The Semantic Scaffold: Functional Dissociation of Visual and Language-derived Features Shapes Human Natural Scene Understanding | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article The Semantic Scaffold: Functional Dissociation of Visual and Language-derived Features Shapes Human Natural Scene Understanding Yu Zhang, Yuxuan Tu, Zihan Yin, Jing Zhang, Weiyang Shi, Siyang Li, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8259624/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted You are reading this latest preprint version Abstract Natural scene understanding requires the seamless integration of high-resolution sensory inputs with abstract conceptual knowledge. Conventional computational models often treat scene comprehension as a feed-forward, visual-centric process. Here, we challenge this view by proposing the Semantic Scaffold framework, positing that language-derived semantic knowledge acts as a foundational component that actively shapes visual perception. To test this, we leveraged unimodal (visual-only, language-only) and multimodal (visual-language) encoding models as computational probes on the massive 7T fMRI Natural Scenes Dataset (NSD) to systematically dissect the functional topography of the human cortex. We reveal a fundamental cortical dissociation: perceptually-driven visual features are confined to the visual cortex, whereas language-derived features robustly predict activity across expansive frontal and temporal association cortices. Crucially, multimodal integration is necessary to model neural activity at the interface of these systems, providing empirical support for an integrated mechanism where top-down semantic knowledge contextually modulates visual input. Furthermore, we characterize the internal structure of this semantic scaffold, revealing unified atlas organized along a dominant animate-inanimate axis with robust left-hemisphere lateralization. Our study repositions language-derived knowledge from a secondary consequence to a primary cognitive scaffold, advancing an integrated mechanistic understanding of how the human brain constructs a coherent perception of the world. Graphic Abstract Biological sciences/Neuroscience/Computational neuroscience/Neural encoding Biological sciences/Neuroscience/Cognitive neuroscience/Perception Biological sciences/Neuroscience/Cognitive neuroscience/Language Encoding model Natural scene understanding Multimodal Integration Vision-language models Top-Down Modulation fMRI Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Highlights Multimodal AI models reveal a clear functional dissociation between sensory-driven visual processing and language-derived semantic pathway. This establishes a distinct semantic encoding system in frontal and temporal association cortices, distinct from the feed-forward, bottom-up visual processing. Multimodal integration of visual and semantic features yields widespread superior predictions relative to their unimodal counterparts, validating the integration component of the Semantic Scaffold framework. This semantic system is organized as a unified atlas, characterized by a dominant animate-vs-inanimate gradient and a robust left-hemisphere lateralization. Significance Statement How the human brain understands a complex visual scene is traditionally modeled as a feed-forward, visual-centric process. This work challenges this traditional view, proposing a Semantic Scaffold framework where language-derived knowledge is a foundational component in natural-scene understanding that actively shapes visual perception. Using high-resolution 7T fMRI and advanced computational models, we establish two core components of this framework: 1) functional dissociation between visual pathways and distinct semantic pathway in the frontal and temporal lobes, and 2) an integration process whereby these pathways converge to form a unified perception. This work repositions language-derived knowledge as a primary, active component in how humans build a coherent perception of the world. Introduction Deciphering how the human brain constructs a coherent interpretation of the visual world remains a fundamental challenge in neuroscience, requiring the flexible association of high-resolution sensory input with the abstract, conceptual knowledge. Classically, the neural basis of object recognition and scene understanding has been modeled as a hierarchical, feed-forward processing, extending from early sensory areas (V1) through the ventral visual stream to high-level temporal cortices (IT/VTC) (DiCarlo et al., 2012 ; Yamins and DiCarlo, 2016 ). In this view, semantic interpretation is commonly regarded as an emergent property derived primarily from the bottom-up processing of visual features. This visual-centric paradigm has been reinforced by recent advances in neuro-AI, where deep neural networks trained solely on visual tasks (e.g., AlexNet, ResNet) have proven remarkably effective at predicting neural activity across the primate and human visual system (Bonnen et al., 2021 ; Güçlü and Gerven, 2015; Horikawa and Kamitani, 2017 ; Schrimpf et al., 2020 ; Yamins et al., 2014 ). These vision-based models, however, excel primarily at explaining neural dynamics related to bottom-up sensory processing while often neglecting the profound influence of top-down factors and semantic context, which are central to theories of predictive processing (Millidge et al., 2022 ; Salvatori et al., 2021 ). Accumulating evidence suggests that real-world scene comprehension is rarely a purely bottom-up visual exercise, but instead relies heavily on the top-down semantic grounding that employs abstract conceptual knowledge to resolve perceptual ambiguity and structure our understanding of the environment (Bi, 2021 ; Lupyan et al., 2020 ). This reliance on conceptual knowledge implies that the neural architecture of scene understanding involves the interplay and convergence of sensory-derived representations and abstract, language-derived knowledge. Prior work has successfully mapped large-scale semantic spaces using narrative speech or isolated linguistic stimuli (Huth et al., 2016 , 2012 ; Mitchell et al., 2008 ; Popham et al., 2021 ). More recently, several studies have demonstrated that multimodal deep learning models (combining vision and text) predict brain responses more accurately than unimodal models (Bonner and Epstein, 2021 ; Doerig et al., 2025 ). While these studies have established the computational utility of multimodal models, the biological architecture underlying this improvement remains unresolved. It remains unclear whether language-derived semantic information is simply integrated into the visual stream itself, or if it reflects the recruitment of a distinct, anatomically dissociable pathway that creates a conceptual scaffold for vision. Furthermore, the precise functional topography of where perceptually-driven visual features dissociate from abstract semantic features has not been definitively mapped during naturalistic processing. To address this critical gap, we propose the Semantic Scaffold framework, positing that human scene understanding relies on two core components: 1) a functional dissociation between two representational streams, constituting of a bottom-up visual pathway and a top-down, language-derived pathway that provides semantic context; 2) an integration process whereby the language-derived pathway provides a contextual scaffold to actively shape the coherent perception. To test this hypothesis, we leveraged a suite of unimodal (visual-only, language-only) and multimodal (visual-language) deep learning architectures as computational probes against the massive 7T fMRI Natural Scenes Dataset (NSD; (Allen et al., 2021 ) to systematically dissect the unique cortical contributions of visual- versus language-derived representations. Our analysis reveals three key findings that advance the mechanistic understanding of natural-scene processing. First, we identify a functional dissociation between the two processing streams. While perceptually-driven visual features are strictly confined to the visual cortex, language-derived semantic features robustly predict activity across expansive frontal and temporal association cortices, independent of visual complexity. Second, we demonstrate that multimodal integration is critical for modeling activity at the interface of the two systems, providing empirical support for a multimodal mechanism where top-down semantic knowledge contextually modulates visual input. Finally, we resolve the internal structure of this semantic scaffold, revealing a unified atlas organized around a dominant animate-inanimate axis with robust left-hemisphere lateralization. Collectively, these findings reposition language-derived knowledge from a secondary consequence of vision to a foundational, active scaffold that shapes human experience of the visual world. Results Visual encoding models map the cortical hierarchy of the visual system We first validated our encoding framework by demonstrating its ability to recapitulate the well-established cortical hierarchy of the human visual system. As expected, vision-based encoding models using convolutional neural networks (CNN; e.g., ResNet-50) and vision transformer (ViT) successfully predicted neural activity across the entire visual cortex. In addition, we observed a clear hierarchical progression in CNN-based encoding maps: shallower layers best predicted activity in early visual areas (V1-V3), while deeper layers extended to higher-order regions in the ventral (“what”) and dorsal (“where”) pathways. Using latent features from the four residual blocks of ResNet-50, we found a clear functional specialization progressing along the ventral and dorsal pathways (Fig. 2 A). Specifically, the earliest layer (Block 1) selectively predicted activity in early retinotopic areas (V1-V3), reflecting the encoding of low-level features like edges and contrast. The intermediate layers (Blocks 2 and 3) extended the predictive power to higher-order regions in both ventral (e.g., LOC, pFs) and dorsal (e.g., V3A/B, IPS) streams, indicating increasing selectivity for intermediate shape and spatial geometry. Notably, Block 3 additionally engaged scene- and motion-selective periphery (PPA, OPA, hMT+), consistent with its role in developing abstract, position-tolerant representations for recognizing complex patterns and moving objects. The deepest layer (Block 4) showed minimal correspondence with early visual cortex, instead achieving significant predictions in higher-order regions dedicated to visuospatial and motion processing (e.g., hMT+, posterior IPS), representing global attributes of the scene like spatial location and dynamic motion. This functional specialization confirmed that CNN-based encoding models mirror the cortical hierarchy of visual processing, progressing from low-level features and shape geometry to abstract representation of complex patterns and object interaction. A direct comparison of CNN-based and transformer-based architectures revealed complementary, anatomically specific advantages (Fig. 2 B). Compared to ResNet, ViT yielded significantly higher prediction accuracy in early visual cortex (V1-V3) and along the ventral “what” pathway (e.g., LOC, FFA, PPA), suggesting that ViT’s self-attention mechanisms effectively captured the fine-grained, holistic details required for robust object recognition. In contrast, ResNet excelled along the dorsal “where” pathway (e.g., V3A/B, IPS, MT+), consistent with established benefit of using translation-equivariant kernels for processing information related to spatial location and motion. Furthermore, mapping these vertex-wise prediction maps onto individual visual areas defined by the Kastner atlas (Wang et al., 2015a ), we confirmed a strict hierarchical correspondence among vision models that low-level CNN layers predicted activity in V1-V3 (edges and contour), intermediate layers predicted V3A/B and VO1/2 (object shape and boundaries), deepest layers predicted hMT+ (motion and direction), while ViT encoded both low-level details in V1-V3 and abstract representations in the ventral pathway (Fig. 2 B). When mapping these predictions onto large-scale functional networks (Fig. 2 C), we demonstrated that ViT’s advantage extended beyond the visual cortex and into the dorsal and ventral attention networks. This finding underscores the benefit of global self-attention architecture for modeling both sensory processing and associated attention-related cortical responses. Language models reveal a semantic system beyond the visual cortex We hypothesized that scene representations derived from linguistic and semantic embeddings would encode neural activity extending beyond the boundaries of the classical visual cortex. Using an 80-dimensional multi-hot vector of COCO “thing” labels, we observed significant predictions (FDR corrected, p < 0.01) across extensive frontal, parietal and temporal association regions (Fig. 4 A). Critically, this simple multi-hot label vector surpassed the best vision-based models (ViT) in prediction accuracy across widely distributed functional areas, including high-order visual areas, frontoparietal, dorsal-attention and default-mode networks (Fig. 4 C). Substituting the multi-hot labels with semantic embeddings extracted from BERT (text only, using five-sentence captions of each scene provided by the COCO dataset) yielded a similar, broadly distributed cortical map (Fig. 3 A). While these two types of models exhibited comparable encoding accuracy in high-order visual cortex, BERT achieved higher peak correlations in anterior temporal and inferior frontal regions compared to MultiHot, suggesting the extraction of superior abstract semantic features in BERT that are better suited for comprehensive scene understanding. Intriguingly, despite its lower complexity and fewer parameters, the MultiHot encoding model proved highly competitive by achieving a median prediction accuracy ( r ) only 3% less than BERT and produced only 1.7% fewer significant predictions (FDR corrected, p < 0.01). Furthermore, both types of language-based encoding models exhibited similar predictive profiles along the dorsal and ventral streams (Figure S2). These results collectively demonstrate that language-derived representations robustly encode neural responses in the association cortex, consistently outperforming purely visual representations and extending their predictive power beyond higher-order visual areas into frontal, parietal, and temporal regions (Fig. 4 A). This finding establishes the existence of a widespread cortical semantic system dedicated to representing abstract semantic knowledge for scene understanding, which is functionally distinct from visual appearance and extends far beyond the traditional boundaries of the visual cortex. We next investigated the effect of richer linguistic context on the encoding performance by implementing the BERT model on text captions of varying caption lengths, i.e., shortest (word count: 8.51 \(\:\:\pm\:\:\) 0.81), medium-length (10.12 \(\:\:\pm\:\:\) 1.21 words), longest captions (13.19 \(\:\:\pm\:\:\) 3.11 words), as well as the full 5-sentence captions. The resulting semantic embeddings were used to predict vertex-wise fMRI responses of each scene, separately. The overall whole-brain encoding maps remained highly consistent in their spatial distributions across different caption lengths (Figure S4 and S5), extending broadly from the visual cortex to the frontal, parietal and temporal lobes. However, a detailed analysis of the top 1% best-predicted cortical vertices (approximately 700 vertices per hemisphere) revealed a clear dependency on the linguistic context that both median and maximum prediction accuracy ( r values) showed a monotonic increase with increasing caption length (Fig. 3 B ). In particular, BERT with the full 5-sentence captions achieved the highest encoding performance among language-based models, reaching a maximum encoding r = 0.714 (Fig. 3 A). The most accurately predicted areas included those engaged in the recognition of motion (e.g., hMT/MST), faces (e.g., FFA) and scenes (e.g., PPA, RSC). This finding demonstrates that richer linguistic context and additional semantic details significantly enhance the prediction of brain activity in association areas, even in the absence of visual information. Our analysis revealed a significant hemispheric lateralization effect among language-based encoding models. Specifically, both MultiHot and BERT-based encoding models exhibited a clear left-hemisphere dominance, with the median r-values 0.06 − 0.08 higher on the left hemisphere than on the right (Fig. 5 D). This pattern was consistently observed across all caption lengths (Figure S5). In contrast, vision-based encoding models (ResNet and ViT) exhibited a bilaterally distributed pattern, suggesting equal contribution from both hemispheres for vision perception. Among all encoding models, BERT exhibited the strongest left-hemisphere dominance, with a median-correlation laterality index (LI) of 0.073, followed by the multimodal (0.067) and MultiHot (0.06), while vision-based models displayed weak or even right-hemisphere lateralization (ViT: LI = -0.04; ResNet: LI = -0.002). The left-hemishphere lateralization of language models was consistently observed across participants, caption lengths (single sentence vs. five-sentence) and text formats (object-label vectors vs. full-sentence captions) (Figure S5). These findings align with the typical dominance of the left hemisphere for language and semantic processing, contrasting with the bilateral distribution for sensory processing. This suggests that semantic embeddings play a crucial role in the reconstruction of rich linguistic and contextual information for comprehensive natural-scene understanding. Multimodal Fusion Outperforms Unimodal Models Across the Brain Vision and language models each captured unique components of neural activity during natural-scene viewing. Visual features extracted from CNN and ViT dominated early visual areas (V1-V3), while semantic embeddings from MultiHot and BERT preferentially explained activity in associated areas of frontoparietal and dorsal/ventral attention networks. We therefore hypothesized that the joint embedding of visual and linguistic information would amplify these complementary signals and yield the best prediction of neural activity across the whole brain. To test this, we implemented a multimodal encoding model based on ViLT, which was pretrained to align image patches, object labels and text captions. As shown in Fig. 4 A and Table S1 , the ViLT-based encoding model significantly outperformed every unimodal model with its median prediction accuracy rose by 3% over BERT, 8% over MultiHot, 28% over ViT, and 30% over ResNet. Crucially, the gain relative to vision-based models was markedly stronger in the left hemisphere (ViLT - ViT: 33% left vs 23% right; ViLT - ResNet: 34% left vs 26% right), whereas language models showed equivalent improvements across both hemispheres (ViLT - BERT: 2.7% left vs 3.4% right; ViLT - MultiHot: 8.4% left vs 7.1% right). The significantly predicted cortical vertices (FDR corrected, p < 0.01) extended from the early visual cortex along both dorsal and ventral pathways into perisylvian language areas and frontoparietal control networks (e.g., LOC, FFA, PPA, PHC, hMT+, IPS, SPS). This finding indicates that the cross-modal interaction of visual and linguistic information boosts brain encoding beyond a simple additive benefit of vision-plus-language alone. For instance, ViLT outperformed BERT in high-order visual cortex (V3A/B, hMT+) and surpassed ViT throughout dorsal and ventral streams as well as in frontoparietal and attention networks (Fig. 5 A,B). The best-predicted sites were found in the lateral temporal cortex (TO) and parahippocampal cortex (PHC), highlighting the behavioral relevance of multimodal representations for object recognition and scene understanding. Difference maps of multimodal and unimodal encoding models (Fig. 5 A) further dissected the unique contribution of each modality. The visual component (e.g., ViLT - BERT) revealed positive clusters in hMT, MST and OFA, indicating the enhancement of low-level visual details beyond abstract representations derived from language alone. Conversely, the semantic component (e.g., ViLT - ViT) demonstrated widespread gains in frontal and parietal regions, reflecting a global amplification of semantic information in neural encoding over purely visual, low-level image features. Regional encoding performance of individual visual areas using the Kastner atlas confirmed these patterns (Fig. 5 B). While ViT’s dominance was confined to early visual cortex (V1v, V1d and V2v), language-based models (BERT and MultiHot) dominated the ventral and dorsal visual pathways. Critically, the ViLT model aggregated these advantages by matching ViT in early visual cortex while outperforming other all models elsewhere. Mapping these results onto the Yeo-7 functional networks revealed a clear functional gradient in encoding accuracy (Fig. 4 C), which decayed across brain networks from visual to dorsal-attention, frontoparietal, and default-mode networks. Crucially, within each association network, the language-based and multimodal models systematically outperformed vision-only models. This demonstrates that natural-scene understanding consistently recruits abstract, language-like representations throughout the cortex that supersede purely visual coding. Last but not the least, the ViLT model also exhibited a left-hemisphere lateralization effect (LI of median correlation: 0.067), similar to the language-based models (BERT: 0.074; MultiHot: 0.060), in contrast to the bilaterally distributed patterns in vision-based models (Figure S5). The combined evidence underscores the essential role of incorporating both semantic and visual features for an accurate account of human scene understanding. Inter-subject variability is driven by language, not vision To quantify how scene-comprehension strategies differ across individuals, we trained vertex-wise encoding models separately for each participant and subsequently computed the inter-subject variance of prediction accuracy. The joint image-text embeddings in ViLT consistently yielded high encoding accuracy in the lateral occipital and temporal cortex, as well as in frontoparietal regions (Fig. 7 ). Crucially, the variability maps revealed a clear spatial dissociation that inter-subject variance was highest in the frontoparietal network and ventral visual pathway (e.g., V4, hMT, MST, FFA, PPA, IPS), while early visual areas (V1-V3) showed relatively low variability. This topography was prominent for both the multimodal (ViLT) and language-based (BERT) models, but was notably absent for the vision-based (ViT) model, which exhibited low variability across the entire cortex, with minor peaks in visual areas and the ventral pathway (Fig. 6 A) . Further correlation analysis of variability maps demonstrated that the inter-subject variability of multimodal encoding (ViLT) was almost perfectly predicted by unimodal language-based models (ViLT vs BERT: r = 0.97), but only moderately corrected with vision-only models (ViLT vs ViT: r = 0.74). Scatter plots of vertex-wise variance maps confirmed this visual-linguistic dissociation (Fig. 6 B) such that vertices of BERT-vs-ViLT clustered tightly along the identity line, whereas vertices of ViLT-vs-ViT and Vit-vs-BERT were more broadly distributed, exhibiting a zero-shift at high-variance vertices especially for vertices showing low accuracy in BERT but relatively high accuracy in ViT. These findings suggest that inter-subject differences in natural-scene understanding are primarily driven by their variability in language-mediated, high-level semantic processing, rather than by differences in low-level visual perception. Semantic cortical atlas of objects and concepts The semantic encoding models not only demonstrated high predictive power across the entire cortex but also offered strong interpretability and behavioral relevance through semantic mapping. To create a unified semantic cortical atlas of objects and concepts, we trained a single MultiHot encoding model using group-concatenated fMRI data from all participants. The resulting group-level weight matrix (80 COCO “thing” categories by 163k cortical vertices) was then projected onto low-dimensional semantic axes using principal component analysis (PCA). The first three principal components (PCs), which collectively explained 60% of the total variance, served as a unified color palette for the semantic mapping (Fig. 8 A). By projecting individual participants’ cortical vertices onto these semantic axes, we generated a continuous semantic map for each brain (Fig. 8 B,C), with each PC capturing a distinct semantic gradient. Specifically, PC1 (regions in blue) strongly correlated with the animate-vs-inanimate distinction and clustered within the early visual cortex (V1-V3). PC2 (regions in green) encoded action-related and movable objects, primarily located in areas of the dorsal pathway including the intraparietal sulcus (IPS), superior parietal lobule (SPL) and hippocampus. PC3 (regions in purple) indexed concepts such as person, food, and man-made tools, occupying areas in the ventral pathway including FFA and OFA, and extending into prefrontal cortex. This demonstrates that the low-dimensional semantic embedding of objects and concepts successfully recovered behaviorally meaningful, semantic gradients that respect the functional anatomy of the human visual system. Discussion In this study, we leveraged a suite of unimodal (visual-only, language-only) and multimodal (visual-language) deep learning architectures as computational probes to test the Semantic Scaffold hypothesis for human natural-scene understanding. Our principal findings demonstrate that comprehensive scene comprehension critically relies on abstract, language-derived semantic knowledge that extends far beyond the classical visual cortex, challenging the traditional view of semantic processing as a purely downstream visual consequence. We provide strong evidence for the two core components of our framework. First, supporting the dissociation component, while perceptually-driven representations remained largely confined to the visual cortex, language-derived representations robustly predicted activity across expansive association networks. Second, supporting the integration process, we found that multimodal joint embeddings systematically outperformed visual-only models across expansive frontal and temporal association cortices. This superior performance underscores the dynamic interplay between bottom-up sensory processing and top-down semantic grounding, suggesting that visual input is continuously interpreted through existing semantic knowledge structures. Further gradient analysis revealed a unified semantic atlas organized along the animate-inanimate axis, and a reliable left-hemisphere lateralization for high-level semantic integration. Collectively, our study repositions language-derived semantic knowledge as a primary, foundational component organizing cortical representations, advancing a computational roadmap for disentangling the roles of vision and language in shaping the functional topography of the human brain. Semantic scaffolding: dissociation of visual and abstract semantic features Our findings provide strong support for the dissociation component of our Semantic Scaffold hypothesis, suggesting a clear functional separation between sensory- and language-derived features. First, we established a baseline by confirming that visual-only encoding models (e.g., ResNet, ViT) consistently replicate the foundational neuro-AI work (Bonnen et al., 2021 ; Güçlü and Gerven, 2015; Horikawa and Kamitani, 2017 ; Schrimpf et al., 2020 ; Yamins et al., 2014 ). In line with these classic studies, we found that perceptually-driven representations remained largely confined to the visual cortex, supporting the well-established cortical hierarchy for visual processing (V1 to VTC). Neuroscientists initially employed brain encoding models to determine which AI model is most “brain-like” (Schrimpf et al., 2020 ). Following foundational work by (Schrimpf et al., 2020 ; Yamins and DiCarlo, 2016 ), studies employed vision models like LeNet, AlexNet, and VGGnet to accurately predict fMRI activity and reconstruct visual stimuli in areas V1 through IT (Güçlü and Gerven, 2015; Horikawa and Kamitani, 2017 ; Nishimoto et al., 2011 ; Shen et al., 2019 ). Our results confirmed this classic view of human visual system that perceptually-driven representations remained largely confined to the visual cortex, supporting the cortical hierarchy for low-level visual processing (V1 to V4/LO). In sharp contrast, both language-based models (e.g., MultiHot, BERT) and multimodal joint embeddings (ViLT) substantially outperformed visual-only models across extensive frontal and temporal association cortices. This dichotomy strongly supports the evolving perspective that scene interpretation relies not solely on sensory perception, but also on abstract, language-derived representations (Bi, 2021 ; Lupyan et al., 2020 ). Our results establish a distinct semantic encoding pathway in frontal and temporal lobes, robustly dissociable from purely visual perception. These association cortices, crucial for conceptual retrieval and knowledge representation, show a profound preference for abstract semantic features. This semantic knowledge structure, effectively captured by both complex contextual embeddings of BERT and even minimal categorical tags of Multi-Hot vectors, enables these encoding models to maintain their predictive power where purely image features lose efficacy. This finding resonates deeply seminal work demonstrating that concepts are represented in areas far beyond the visual cortex, often overlapping with neural representations of auditory and linguistic stimuli (LeBel et al., 2021 ; Nishida and Nishimoto, 2018 ; Popham et al., 2021 ). Together, our findings establish that this distinct, high-level semantic pathway acts as a crucial cognitive scaffold for sensory perception. Multimodal integration and top-down semantic guidance The superior performance of the multimodal ViLT model relative to its unimodal counterparts provides direct empirical validation for the integration component of our Semantic Scaffold hypothesis. This finding confirms that multimodal fusion offers distinct advantages for the accurate mapping of complex brain activity, aligning with advances in AI vision-language transformers like VisualBERT (Li et al., 2019 ), CLIP (Radford et al., 2021 ), and ViLT (Dosovitskiy et al., 2020 ). Our results go a step further, suggesting a plausible neural mechanism where top-down semantic knowledge contextually modulates and refines the interpretation of incoming visual information. By fusing visual patches with semantic tokens, ViLT captures the dynamic interplay between bottom-up sensory perception and top-down cognitive influence. The superior performance of this integrated approach suggests that comprehensive scene understanding inherently requires abstract, language-derived knowledge to contextually modulate and refine the interpretation of incoming visual input, in line with Predictive Coding theories (Millidge et al., 2022 ; Salvatori et al., 2021 ). This crucial role positions top-down semantic guidance not as merely supplementary, but as a critical and continuous factor in achieving robust unified perception. These findings, therefore, provide empirical support that the brain relies on a similar, integrated computational mechanism, as conceptualized by our scaffold framework. The nature of the Semantic Scaffold: unified semantic atlas and left-hemisphere lateralization We have established the dissociation and integration components of our Semantic Scaffold hypothesis. Next, we sought to understand its internal structure. Further analysis of the language-based models confirmed the content and organization of this semantic scaffolding system. Multi-Hot encoding, derived from remarkably simple categorical tags, robustly recovered the principal axes of semantic organization, defining a dominant animate-vs-inanimate axis, complemented by action-related and man-made dimensions. These semantic gradients directly replicate the core structure that consistently recovered in prior semantic mapping studies (Huth et al., 2016 , 2012 ), which is universally recognized as the principal dimension of semantic organization. This finding aligns with identical components observed in semantic maps derived from fMRI responses to natural movies (Huth et al., 2016 ) and the cortical hierarchy of auditory-linguistic atlases (Doerig et al., 2025 ; Popham et al., 2021 ; Wang et al., 2023 ). Likewise, the action-related dimension aligns with the separation of verbs/actions from nouns/objects, often mapped to dorsal visual stream involved in motion and manipulation, and the man-made/civilization dimension corresponds to a third major gradient that differentiates tools and manufactured items (Bonner and Epstein, 2021 ; Mitchell et al., 2008 ). The efficacy of the Multi-Hot encoding model confirms that while complex language models like BERT capture fine-grained contextual nuances, the core organizational scaffold of human semantic network is fundamentally categorical, supporting a robust, unified semantic atlas of the human cortex. Futhermore, this abstract semantic system exhibits a robust functional asymmetry with notable left-hemisphere lateralization for both the language-based (BERT and MultiHot) and multimodal encoding models. This strong lateralization, consistently observed across participants and invariant to text format and caption lengths, aligns with the established left-hemisphere dominance for language comprehension and higher-level semantic memory (Fedorenko et al., 2011 ; Malik-Moraleda et al., 2022 ). Our findings connect directly to recent work with Large Language Models (LLMs) showing that this asymmetry emerges and strengthens with increasing model complexity (Antonello et al., 2024 ; Doerig et al., 2025 ; Grand et al., 2022 ). Crucially, this dominance holds even when using abstract Multi-Hot categorical tags instead of continuous semantics, suggesting the left hemisphere is intrinsically biased toward high-level conceptual integration, irrespective of inputs’ complexity or sensory modality. This left-hemisphere specialization provides a structural basis for the top-down cognitive influence observed in the multimodal model, explaining how semantic features effectively ground the visual input. Therefore, the left-hemisphere lateralization of semantic encoding is not merely a linguistic artifact, but a fundamental reflection of the brain’s highly contextual and relational semantic structure. Limitations and Future Directions The interpretation of this study is subject to several limitations. First, our findings are based on the intensive, deep-sampling, small-cohort design of the NSD dataset. While this approach is powerful for building robust individual-subject models, the small sample size limits the generalizability of our findings and hinders a comprehensive characterization of inter-subject variability in semantic topography. Future work with larger cohorts is necessary to fully validate the precise spatial organization of these semantic maps across the general population. Second, our study used a static design (a large set of isolated images). This approach was sufficient to establish the existence of the Semantic Scaffold framework, as well as its dissociated pathways (visual vs. language) and their eventual integration (the superior performance of ViLT). However, this may limit the overall power to model the dynamic, contextual integration that occurs in the real world. The next crucial step is to extend this analysis to dynamic, naturalistic stimuli (videos with speech/dialog) and use time-resolved encoding models to test how the multimodal advantage shifts when visual and linguistic content is temporally coordinated and causally linked. This dynamic approach is essential for moving from a static map of the scaffold to a full mechanistic model of multimodal semantic binding. Conclusion This work challenges the traditional, visual-centric model of scene understanding. We propose the Semantic Scaffold framework, which posits that language-derived knowledge acts as a foundational component for visual perception, not just a downstream consequence. Our results establish the two core components of this framework. First, a fundamental functional dissociation between perceptually-driven visual representations (confined to the visual cortex) and a distinct, abstract semantic pathway (in frontal and temporal lobes). Second, an integration process, evidenced by the widespread superior performance of multimodal models, demonstrating that these two pathways converge to form a unified, coherent perception. Our findings demonstrate that language-derived semantic knowledge is not a passive, secondary feature, but rather an active scaffold that contextually modulates and refines incoming sensory input. Furthermore, we characterized the nature of this scaffold, revealing a unified semantic atlas organized by a dominant animate-inanimate axis and a robust left-hemisphere lateralization. Collectively, our findings advance an integrated computational mechanism for scene understanding, repositioning language-derived knowledge as a primary component in how humans build a coherent perception of the world. Materials and Methods Participants and Dataset We utilized the publicly available Natural Scenes Dataset (NSD) (Allen et al., 2021 ), which comprises whole-brain 7T fMRI scans from eight healthy participants. The NSD paradigm involved 40 sessions per participant, each viewing 10,000 natural scenes over a one-year period. Among with, a fixed set of 1,000 images was repeated for every participant and reserved for model testing. The remaining 9,000 images were unique to each participant and used for training subject-specific vertex-wise encoding models. During fMRI scanning protocol, each image was presented for 3 seconds, followed by a 1-second inter-stimulus interval. For the present analysis, we included only the four participants (subj01, subj02, subj05, subj07) who completed all 40 sessions. All stimuli of natural scenes were drawn from the Common Objects in Context (COCO) dataset (Lin et al., 2014 ), a standard benchmark for object detection, instance/semantic segmentation, and key-point estimation. In addition to the scene images, we used the corresponding COCO metadata, specifically the 80 “thing” categories (common objects like person, bicycle, elephant, pizza, etc.) and five human-generated captions (10 to 20 words per caption) of each scene. These captions and object labels served as the linguistic inputs to our language and multimodal encoding models. fMRI Data Acquisition and Preprocessing Functional MRI data were acquired at a 7-Tesla field strength with a high spatial resolution of 1.8 mm isotropic voxels. Standard preprocessing steps were performed using fmriprep pipeline v24.0.0 (Esteban et al., 2019 ), including slice timing correction, head motion correction, co-registration of the functional images to the participant’s T1w image, and nuisance regression of motion parameters, white matter (WM), and cerebrospinal fluid (CSF) signals. The functional images were subsequently normalized to the ICBM152 template by combining the linear functional-to-structural transformation with the nonlinear warpping from individual structural space to the MNI space. To accurately estimate the brain response to each short-duration (3 s) scene presentation, which is challenging due to low signal-to-noise ratio (SNR) and highly overlapping hemodynamic response function (HRF) effects in rapid event-related designs, we employed the GLMsingle model (Prince et al., 2022 ), a specialized toolbox designed to robustly estimate single-trial beta-values, representing the fMRI response to each scene image. Specifically, we first estimated the optimal voxel-specific HRFs using a library of 20 HRF basis functions by selecting the best fit of BOLD signals for the current voxel. Then, the resulting HRF index map was denoised using the single-trial nuisance regression incorporating head motion parameters (Kay et al., 2013 ). After that, the fractional ridge regression model was applied to disassociate the contribution of each trial to the measured BOLD signals and estimate the \(\:\beta\:\) values of each trial, representing fMRI response to each scene image. The tradeoff between the regularized and unregularized coefficients in the model was controlled by a fixed ratio γ. $$\:{\widehat{\beta\:}}^{RR}={}_{\beta\:}{}^{argmin}({‖y-\left(X\ast\:{f}_{hrf}^{\ast\:}\right)\beta\:‖}^{2}+{‖\beta\:‖}^{2})$$ $$\:{\widehat{\beta\:}}^{OLS}={}_{\beta\:}{}^{argmin}\left({‖y-\left(X\ast\:{f}_{hrf}^{\ast\:}\right)\beta\:‖}^{2}\right)$$ 1 where \(\:y\) is the measured BOLD time-series; \(\:X\) is the design matrix of visual stimuli by merging a series of delta functions with 1 indicating the duration of a specific event and 0 for other time points; \(\:{\widehat{{\beta\:}}}^{RR}\) is the best fit with regularized coefficients; \(\:{\widehat{{\beta\:}}}^{OLS}\) is the estimated \(\:\beta\:\) without regularization. Then, the actual brain response \(\:{\widehat{\beta\:}}^{\ast\:}\) was calculated as: $$\:\gamma\:=\frac{{‖{\widehat{\beta\:}}^{RR}‖}_{2}}{{‖{\widehat{\beta\:}}^{OLS}‖}_{2}}$$ $$\:{\widehat{\beta\:}}^{\ast\:}={}_{\beta\:}{}^{argmin}({‖y-\left(X\ast\:{f}_{hrf}^{\ast\:}\right)\beta\:‖}^{2}+\gamma\:{‖\beta\:‖}^{2})$$ 2 The resulting voxel-wise maps of \(\:{\widehat{\beta\:}}^{\ast\:}\) were then projected onto the fsaverage cortical surface template (Fischl et al., 2004 ), resulting in a vertex-wise brain response map (163,842 vertices per hemisphere) for each scene image. These vertex-wise brain maps served as the target variable for all subsequent encoding model analyses. Vertex-wise Encoding Model We implemented a vertex-based encoding framework to predict single-trial fMRI responses of natural scenes using representations derived from computer vision, language and multimodal deep-learning architectures. This framework links stimulus features to cortical activity through a two-stage process. Firstly, all types of stimulus content, like scene images or the corresponding object labels or text captions, were embedded in a feature vector of latent representations extracted from various computational models (i.e., feature embedding). Secondly, a subject-specific, vertex-wise ridge regression model was trained to predict the measured single-trial fMRI responses using the extracted latent features (ridge regression). Feature embedding For each single-trial stimulus, we located the scene image \(\:{I}_{i}\) and its corresponding text captions \(\:{T}_{i}\) from the COCO dataset, and extracted various types of latent feature vectors in terms of image patches, object categories, text captions, and joint embeddings of both image and text, which are formulated as follows: $$\:{z}_{i}^{V}={f}_{V}\left({I}_{i}\right)$$ ; $$\:{z}_{i}^{T}={f}_{T}\left({T}_{i}\right)$$ ; $$\:{z}_{i}^{\text{M}\text{u}\text{l}\text{t}\text{i}\text{H}\text{o}\text{t}}={f}_{\text{M}\text{u}\text{l}\text{t}\text{i}\text{H}\text{o}\text{t}}\left({T}_{i}\right)$$ ; $$\:{z}_{i}^{\text{J}\text{o}\text{i}\text{n}\text{t}}={f}_{\text{J}\text{o}\text{i}\text{n}\text{t}}\left({I}_{i},\:{T}_{i}\right)$$ 3 where \(\:{z}_{i}^{V}\) represents the image feature vector, extracted by implementing a computer vision model \(\:{f}_{V}\) on the scene image \(\:{I}_{i}\) ; \(\:{z}_{i}^{T}\) represents the text embedding of the i- th image, extracted by applying a language model \(\:{f}_{T}\) on the text captions \(\:{T}_{i}\) ; \(\:{z}_{i}^{\text{M}\text{u}\text{l}\text{t}\text{i}\text{H}\text{o}\text{t}}\) represents the multi-hot embedding of the i- th image, indicating whether a specific object out of 80 “thing” categories has appeared in the image; \(\:{z}_{i}^{\text{J}\text{o}\text{i}\text{n}\text{t}}\) represents the multimodal joint embeddings of image and text for the i- th image, generated by a multimodal model \(\:{f}_{Joint}\) that jointly processes the image \(\:{I}_{i}\) and text captions \(\:{T}_{i}\) . These four types of latent features, i.e., image features ( \(\:{z}_{i}^{V}\) ), text features ( \(\:{z}_{i}^{T}\) ), multihot labels ( \(\:{z}_{i}^{\text{M}\text{u}\text{l}\text{t}\text{i}\text{H}\text{o}\text{t}}\) ), and image-text joint embeddings ( \(\:{z}_{i}^{\text{J}\text{o}\text{i}\text{n}\text{t}}\) ), constituted the input regressors for the subsequent ridge regression models. Ridge Regression We implemented the kernel ridge regression to learn a linear mapping \(\:W\) between the feature embeddings \(\:Z=\left\{{z}_{\text{i}}\right\}\) and the observed vertex-wise brain responses β. The model was trained by minimizing the following objective function: $$\:{\mathcal{ℒ}}_{\text{W}}\left(\text{Z}\right)=\:{‖{\beta\:}-\:\text{Z}\text{W}‖}^{2}+\:{\lambda\:}{‖\text{W}‖}^{2}$$ 4 where \(\:Z\in\:{\mathbb{ℝ}}^{N\times\:D}\) represents the feature embedding matrix for \(\:N\) training stimuli with \(\:\text{D}\) dimensional features; \(\:\beta\:\in\:{\mathbb{ℝ}}^{N\times\:S}\) represents the corresponding brain responses for \(\:S\) cortical vertices across \(\:N\) training stimuli; \(\:W\in\:{\mathbb{ℝ}}^{D\times\:S}\) is the weight matrix of the ridge regression, that projects feature embeddings onto brain responses; \(\:\lambda\:\:\) is the regularization hyperparameter; \(\:\beta\:\) is the cortical version of estimated brain response \(\:{\widehat{\beta\:}}^{\ast\:}\:\) in Eq. 2 . Subject-specific encoding models were trained separately for each participant and each vertex, using their unique 9,000 training images. The optimal hyperparameter \(\:\lambda\:\:\) was selected via 10-fold cross-validation on the training set. We quantified the each encoding model by computing Pearson’s correlation coefficients ( r ) between the predicted ( \(\:ZW\) ) and observed ( \(\:\beta\:\) ) fMRI responses across 1,000 test scene images. Additionally, for multi-hot word embedding \(\:{z}_{i}^{\text{M}\text{u}\text{l}\text{t}\text{i}\text{H}\text{o}\text{t}}\) , we decomposed the weight matrix \(\:{W}_{label}\) into different principal components that define the latent semantic gradients of the 80 object-category representations. Computational Models for Feature Embedding Our encoding framework employed three distinct families of deep learning architectures, including Vision, Language, and Multimodal fusion models, to generate feature embeddings for predicting fMRI activity. For vision-based models, we utilized two prominent architectures, ResNet and Vision Transformer (ViT), to extract latent features of natural scenes across different scales and levels of abstraction. The language-based models, Multi-hot Encoding (MultiHot) and Bidirectional Encoder Representations from Transformers (BERT), captured semantic and linguistic representations of the scene images. Finally, a dedicated multimodal fusion approach was implemented via the Vision-and-Language Transformer (ViLT) architecture. Visual feature embeddings ResNet (He et al., 2016 ) is a canonical Convolutional Neural Network (CNN) that uses skip-residual connections to mitigate gradient vanishing/explosion in deep neural networks. We used an ImageNet-pretrained ResNet-50 architecture and extracted layer activations from its four major residual blocks, yielding a multi-scale hierarchy of visual feature embeddings. Considering the distinct sizes of activation tensors across residual blocks (Block-1: 56 * 56 * 256; Block-2: 28 * 28 * 512; Block-3: 14 * 14 * 1024 ; Block-4: 7 * 7 * 2048), we flattened the activation tensor from each residual block and randomly sampled a fixed 100 k-dimensional vector \(\:{z}_{i}^{V}\) to obtain a uniform representation for each scene image. Vision Transformer (ViT) (Dosovitskiy et al., 2020 ) treats a input image as a sequence of fixed-size image patches, linearly embeds each patch, and processes latent features with a standard Transformer encoder. We used a pretrained ViT-B/16 model and extracted the final-layer patch tokens and the final CLS token, yielding a 768-dimensional feature embedding vector \(\:{\text{z}}_{\text{i}}^{\text{V}}\) for each scene image. Semantic embedding and language models Multi-hot Encoding encodes the presence or absence of each of 80 COCO “thing” categories as an 80-dimensional binary vector \(\:{z}_{i}^{\text{M}\text{u}\text{l}\text{t}\text{i}\text{H}\text{o}\text{t}}\) , whose entries indicate either presence (1) or absence (0) of the corresponding category. By mapping the continuous image-pixel space onto a discrete object-label space, this embedding yields a low-dimensional abstract semantic description of scene images. This approach has previously been used to model the continuous semantic mapping in the human cortex (Huth et al., 2012 ). BERT (Devlin et al., 2019 ) is a bidirectional Transformer encoder pretrained on Wikipedia, yielding context-sensitive embeddings for every language token. For each scene image, we retrieved the associated five sentences of text captions provided by COCO, pooled them into a single text block, tokenized the block, and extracted the CLS token vector from the pretrained BERT model, resulting in a 768-dimensional linguistic signature of each scene image \(\:\:{\text{z}}_{\text{i}}^{\text{T}}\) . To specifically determine how caption length shapes this semantic representation, we repeated the procedure using the shortest, medium-length, or longest single captions, as well as the full five-sentence caption set. Multimodal joint embedding of image and text The multimodal feature space was modeled using the Vision-and-Language Transformer (ViLT) (Kim et al., 2021 ) that uses a single shared Transformer stack to jointly process image patches and caption tokens. The ViLT's architecture integrates three components: image-patch embeddings (from a ViT encoder), word embeddings (from a BERT tokenizer), and cross-modal self-attention layers. The model was originally trained on large-scale datasets, including COCO, using three primary objectives: Image-Text Matching (ITM), Masked Language Modeling (MLM), and Word-Patch Alignment (WPA), along with three major components: image-patch embeddings from a ViT encoder, word embeddings from a BERT tokenizer, and cross-modal self-attention layers. Image-Text Matching (ITM) aims to align the joint embedding space by distinguishing matched image-text pairs from mismatched (negative) pairs. We then sampled from the image-caption pairs and computed negative log-likelihood loss. $$\:{\mathcal{ℒ}}_{\text{I}\text{T}\text{M}}\left({\theta\:}\right)=\:-{\mathbb{E}}_{\left({\text{I}}_{\text{i}},{\:\text{T}}_{\text{j}}\right)\sim\text{D}}(\text{ylog}{\text{s}}_{{\theta\:}}\left({\text{I}}_{\text{i}},{\text{T}}_{\text{j}}\right)+\left(1-\text{y}\right)\text{log}\left(1\:{-\text{s}}_{{\theta\:}}\right({\text{I}}_{\text{i}},{\text{T}}_{\text{j}})\left)\right)$$ 5 where \(\:{\text{s}}_{{\theta\:}}\left({\text{I}}_{\text{i}},{\text{T}}_{\text{j}}\right)\) indicates the alignment of image and text embeddings, measured by cosine similarity, \(\:\text{y}\in\:\{0,\:1\}\) indicates whether the sampled image-caption pair \(\:\left({\text{I}}_{\text{i}},{\text{T}}_{\text{j}}\right)\:\) is matched or not. Masked language modeling (MLM) is used to predict the masked words \(\:{\text{T}}_{\text{i},\text{m}}\) based on the surrounding context \(\:{\text{T}}_{\text{i},\backslash\:\text{m}}\) and all image patches \(\:{\text{I}}_{\text{i},:}\) , minimizing the negative log-likelihood: $$\:{\mathcal{ℒ}}_{\text{M}\text{L}\text{M}}\left({\theta\:}\right)=\:-{\mathbb{E}}_{\left({\text{I}}_{\text{i}},{\:\text{T}}_{\text{i}}\right)\sim\text{D}}\text{log}{\text{P}}_{{\theta\:}}\left({\text{T}}_{\text{i},\text{m}}\right|{\text{T}}_{\text{i},\backslash\:\text{m}},\:{\text{I}}_{\text{i},:})$$ 6 where \(\:{\theta\:}\) is the trainable parameters. Here, we leveraged the pretrained ViLT-B/16 model to obtain the joint image-text embeddings of natural scenes. Each scene image was paired with its five-sentence text captions and a multihot vector of object labels present in the scene. The concatenated inputs were fed into the ViLT model, yielding a 768-dimensional multimodal feature vector from the final CLS token as the joint embedding vector \(\:{z}_{i}^{\text{J}\text{o}\text{i}\text{n}\text{t}}\) . To probe how semantic content modulates this joint representation, we repeated the procedure while varying the textural inputs, such as the shortest, medium-length, or longest single captions, or the multi-hot object-label vector alone. Region-of-interest definition We localized cortical visual areas with the Kastner atlas (Wang et al., 2015b ), a probabilistic parcellation map derived from high-resolution 7T fMRI retinotopic, visuotopic and attention-mapping data. The atlas delineates 25 topographically organized areas, spanning regions primary regions (V1-V3), extrastriate cortex (V3A/B, V4), ventral occipital areas (VO1/2), parahippocampal areas (PHC1/2), lateral occipital areas(LO1/2), temporal occipital areas (TO1/2, encompassing hMT+), intraparietal sulcus areas (IPS0-5), frontal eye field (FEF) and supplementary eye fields (SEF), registered to the fsaverage surface and thresholded at 25%. This yield a set of surface-based probability masks that preserve fine-scale topological boundaries while accounting for inter-subject variability in brain anatomy, enabling region-of-interest analyses precisely aligned to functional visuotopic topography rather than anatomical gyral landmarks. Evaluation of encoding models For each cortical vertex, we quantified the prediction accuracy of encoding models using the Pearson correlation coefficient ( \(\:r\) ) between predicted and observed brain responses across the held-out test set of 1,000 shared scene images. $$\:{r}_{i}=\frac{cov(\widehat{{\beta\:}_{:,i\:}},\:{\beta\:}_{:,i})}{{\sigma\:}_{\widehat{{\beta\:}_{:,i}}}{\sigma\:}_{{\beta\:}_{:,i}}}$$ 7 where \(\:\widehat{{\beta\:}_{:,i\:}}\) denotes the brain response predicted from visual or text embeddings, and \(\:{\beta\:}_{:,i}\) denotes the corresponding measured fMRI response provided by the NSD dataset. This vertex-wise correlation coefficient is a standard metric in previous neuro-AI studies for assessing the correspondence between artificial neural networks and biological brain activity (Güçlü and Gerven, 2015; Horikawa and Kamitani, 2017 ; Schrimpf et al., 2020 ), with a value approaching 1 indicating a near-perfect prediction of neural activity. To correct for multiple comparisons across the large number of cortical vertices (163,842 vertices in the fsaverage surface), we applied a False Discovery Rate (FDR) correction to the prediction accuracy maps. Only vertices with FDR corrected, p < 0.01 were retained as significant predictions in the encoding models. The overall encoding performance, as well as detailed information of the used feature embeddings, are summarized in Table S1. Furthermore, we evaluated the hemispheric lateralization effect for each encoding model by calculating a laterality index \(\:\text{L}\text{I}=({r}_{L}-{r}_{R})/({r}_{L}+{r}_{R})\) , where \(\:{r}_{L}\) and \(\:{r}_{R}\) are the mean correlation of the top 1% vertices in the left and right hemispheres, respectively. The LI index ranges from − 1 (right dominant) to + 1 (left dominant). Inter-subject variability analysis To determine whether visual or linguistic factors drive individual differences in multimodal fused model, we trained subject-specific encoding models and estimated the inter-subject variability across the four participants. Specifically, the vertex-wise prediction accuracy maps were first Fisher-z-transformed, yielding one z-map per model per subject. We computed the across-subject variance of these z-maps and subsequently correlated the resulting variance maps between the visual, linguistic, and multimodal encoding models. This enables us to quantify the extent to which shared variability in the multimodal model is driven by its constituent visual or linguistic components. Semantic gradient analysis Multi-hot object-label encoding model effectively captures high-level semantic structure of objects and concepts in the natural scenes (Huth et al., 2016 , 2012 ). We therefore used them to derive a unified group-level cortical semantic atlas across subjects. Firstly, we concatenated fMRI data of all participants after projecting their individual brain responses onto the fsaverage5 surface template. A group-level category-by-vertex weight matrix was then trained using the Kernel Ridge Regression with 10-fold cross validation. Next, we applied the Singular Value Decomposition (SVD) to the group-level weight matrix and extracted the first three principal components (PC), which jointly explained 60% of the total variance. These PCs define orthogonal “semantic gradients” spanning the cortical surface and establish a unified color palette for representing object semantics. Finally, each participant’s individual weight matrix was projected onto these axes by multiplying it with the PC loadings, yielding a continuous semantic map in the cortex for each subject. This approach enables the clear and intuitive visualization of complex semantic relationships of objects and concepts, and facilitates the comparison of the cortical semantic atlas across individual participants. Declarations Acknowledgment This work was partially supported by the STI2030-Major Projects 2021ZD0200200, 2022ZD0211500, the National Natural Science Foundation of China (Grant Nos. 62201519, 52307259, 62327805, 82151307,82202253). Author contributions Conceptualization: YZ; Methodology: YZ; Visualization: ZHY,YXT, YZ; Data analysis: JZ, ZHY,YXT, YZ; Investigation: ZHY, YXT, YZ, JZ, WYY, TQ, SYL; Writing—original draft: TZJ, YFH, JGD, SYL, YZ; Writing—review & editing: TZJ, YFH, JGD, SYL, YZ; Competing interests The authors declare no competing financial interests. Data and Code availability statement The Natural Scenes Dataset (NSD) and COCO datasets are public and accessible to all researchers. The 7T fMRI dataset from NSD can be accessed via https://naturalscenesdataset.org/. The natural scene images and the corresponding object categories and text captions from COCO can be downloaded from https://cocodataset.org/dataset/home.htm. All unimodal and multimodal encoding models and analysis code will be made available upon request to ensure reproducibility. References Allen, E.J., St-Yves, G., Wu, Y., Breedlove, J.L., Prince, J.S., Dowdle, L.T., Nau, M., Caron, B., Pestilli, F., Charest, I., Hutchinson, J.B., Naselaris, T., Kay, K.: A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nat. Neurosci. 1–11 (2021). https://doi.org/10.1038/s41593-021-00962-x Antonello, R., Vaidya, A., Huth, A.G.: Scaling laws for language encoding models in fMRI. (2024). https://doi.org/10.48550/arXiv.2305.11863 Bi, Y.: Dual coding of knowledge in the human brain. Trends Cogn. Sci. 25 , 883–895 (2021). https://doi.org/10.1016/j.tics.2021.07.006 Bonnen, T., Yamins, D.L.K., Wagner, A.D.: When the ventral visual stream is not enough: A deep learning account of medial temporal lobe involvement in perception. Neuron. 109 , 2755–2766e6 (2021). https://doi.org/10.1016/j.neuron.2021.06.018 Bonner, M.F., Epstein, R.A.: Object representations in the human brain reflect the co-occurrence statistics of vision and language. Nat. Commun. 12 , 4081 (2021). https://doi.org/10.1038/s41467-021-24368-2 Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2019). https://doi.org/10.48550/arXiv.1810.04805 DiCarlo, J.J., Zoccolan, D., Rust, N.C.: How Does the Brain Solve Visual Object Recognition? Neuron. 73 , 415–434 (2012). https://doi.org/10.1016/j.neuron.2012.01.010 Doerig, A., Kietzmann, T.C., Allen, E., Wu, Y., Naselaris, T., Kay, K., Charest, I.: High-level visual representations in the human brain are aligned with large language models. Nat. Mach. Intell. 7 , 1220–1234 (2025). https://doi.org/10.1038/s42256-025-01072-0 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.: An image is worth 16x16 words: Transformers for image recognition at scale. (2020). arXiv preprint arXiv:2010.11929, 无. Esteban, O., Markiewicz, C.J., Blair, R.W., Moodie, C.A., Isik, A.I., Erramuzpe, A., Kent, J.D., Goncalves, M., DuPre, E., Snyder, M., Oya, H., Ghosh, S.S., Wright, J., Durnez, J., Poldrack, R.A., Gorgolewski, K.J.: fMRIPrep: a robust preprocessing pipeline for functional MRI. Nat. Methods. 16 , 111–116 (2019). https://doi.org/10.1038/s41592-018-0235-4 Fedorenko, E., Behr, M.K., Kanwisher, N.: Functional specificity for high-level linguistic processing in the human brain. Proceedings of the National Academy of Sciences 108, 16428–16433. (2011). https://doi.org/10.1073/pnas.1112937108 Fischl, B., van der Kouwe, A., Destrieux, C., Halgren, E., Ségonne, F., Salat, D.H., Busa, E., Seidman, L.J., Goldstein, J., Kennedy, D., Caviness, V., Makris, N., Rosen, B., Dale, A.M.: Automatically parcellating the human cerebral cortex. Cereb. Cortex. 14 , 11–22 (2004). https://doi.org/10.1093/cercor/bhg087 Grand, G., Blank, I.A., Pereira, F., Fedorenko, E.: Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nat. Hum. Behav. 1–13 (2022). https://doi.org/10.1038/s41562-022-01316-8 Güçlü, U., Gerven, M.A.J., van: Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream. J. Neurosci. 35 , 10005–10014 (2015). https://doi.org/10.1523/JNEUROSCI.5023-14.2015 He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). pp. 770–778. (2016) Horikawa, T., Kamitani, Y.: Generic decoding of seen and imagined objects using hierarchical visual features. Nat. Commun. 8 , 15037 (2017). https://doi.org/10.1038/ncomms15037 Huth, A.G., de Heer, W.A., Griffiths, T.L., Theunissen, F.E., Gallant, J.L.: Natural speech reveals the semantic maps that tile human cerebral cortex. Nature. 532 , 453–458 (2016). https://doi.org/10.1038/nature17637 Huth, A.G., Nishimoto, S., Vu, A.T., Gallant, J.L.: A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories across the Human Brain. Neuron 无. 76 , 1210–1224 (2012). https://doi.org/10.1016/j.neuron.2012.10.014 Kay, K., Rokem, A., Winawer, J., Dougherty, R., Wandell, B.: GLMdenoise: a fast, automated technique for denoising task-based fMRI data. Frontiers in neuroscience, 无 247. (2013) Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision, in: International Conference on Machine Learning(ICML). PMLR, pp. 5583–5594. (2021) LeBel, A., Jain, S., Huth, A.G.: Voxelwise Encoding Models Show That Cerebellar Language Representations Are Highly Conceptual. J. Neurosci. 41 , 10341–10355 (2021). https://doi.org/10.1523/JNEUROSCI.0118-21.2021 Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: VisualBERT: A Simple and Performant Baseline for Vision and Language. (2019). https://doi.org/10.48550/arXiv.1908.03557 Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014, pp. 740–755. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48 Lupyan, G., Abdel Rahman, R., Boroditsky, L., Clark, A.: Effects of Language on Visual Perception. Trends Cogn. Sci. 24 , 930–944 (2020). https://doi.org/10.1016/j.tics.2020.08.005 Malik-Moraleda, S., Ayyash, D., Gallée, J., Affourtit, J., Hoffmann, M., Mineroff, Z., Jouravlev, O., Fedorenko, E.: An investigation across 45 languages and 12 language families reveals a universal language network. Nat. Neurosci. 1–6 (2022). https://doi.org/10.1038/s41593-022-01114-5 Millidge, B., Seth, A., Buckley, C.L., Predictive Coding: a Theoretical and, Review, E.: (2022). https://doi.org/10.48550/arXiv.2107.12979 Mitchell, T.M., Shinkareva, S.V., Carlson, A., Chang, K.-M., Malave, V.L., Mason, R.A., Just, M.A.: Predicting Human Brain Activity Associated with the Meanings of Nouns. Science. 320 , 1191–1195 (2008). https://doi.org/10.1126/science.1152876 Nishida, S., Nishimoto, S.: New advances in encoding and decoding of brain signals, vol. 180, pp. 232–242. NeuroImage (2018). https://doi.org/10.1016/j.neuroimage.2017.08.017 Decoding naturalistic experiences from human brain activity via distributed representations of words Nishimoto, S., Vu, A.T., Naselaris, T., Benjamini, Y., Yu, B., Gallant, J.L.: Reconstructing visual experiences from brain activity evoked by natural movies. Curr. Biol. 21 , 1641–1646 (2011). https://doi.org/10.1016/j.cub.2011.08.031 Popham, S.F., Huth, A.G., Bilenko, N.Y., Deniz, F., Gao, J.S., Nunez-Elizalde, A.O., Gallant, J.L.: Visual and linguistic semantic representations are aligned at the border of human visual cortex. Nat. Neurosci. 24 , 1628–1636 (2021). https://doi.org/10.1038/s41593-021-00921-6 Prince, J.S., Charest, I., Kurzawski, J.W., Pyles, J.A., Tarr, M.J., Kay, K.N.: Improving the accuracy of single-trial fMRI response estimates using GLMsingle. Elife, 无 11, e77599. (2022) Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. (2021). https://doi.org/10.48550/arXiv.2103.00020 Salvatori, T., Song, Y., Lukasiewicz, T., Bogacz, R., Xu, Z.: Predictive Coding Can Do Exact Backpropagation on Convolutional and Recurrent Neural Networks. (2021). https://doi.org/10.48550/arXiv.2103.03725 Schrimpf, M., Kubilius, J., Hong, H., Majaj, N.J., Rajalingham, R., Issa, E.B., Kar, K., Bashivan, P., Prescott-Roy, J., Geiger, F., Schmidt, K., Yamins, D.L.K., DiCarlo, J.J.: Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? bioRxiv 407007. (2020). https://doi.org/10.1101/407007 Shen, G., Horikawa, T., Majima, K., Kamitani, Y.: Deep image reconstruction from human brain activity. PLoS Comput. Biol. 15 , e1006633 (2019). https://doi.org/10.1371/journal.pcbi.1006633 Wang, A.Y., Kay, K., Naselaris, T., Tarr, M.J., Wehbe, L.: Nat. Mach. Intell. 5 , 1415–1426 (2023). https://doi.org/10.1038/s42256-023-00753-y Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset Wang, L., Mruczek, R.E., Arcaro, M.J., Kastner, S.: Probabilistic maps of visual topography in human cortex. Cerebral cortex, 无 25, 3911–3931. (2015a) Wang, L., Mruczek, R.E.B., Arcaro, M.J., Kastner, S.: Probabilistic Maps of Visual Topography in Human Cortex. Cereb. Cortex. 25 , 3911–3931 (2015b). https://doi.org/10.1093/cercor/bhu277 Yamins, D.L.K., DiCarlo, J.J.: Using goal-driven deep learning models to understand sensory cortex. Nat. Neurosci. 19 , 356–365 (2016). https://doi.org/10.1038/nn.4244 Yamins, D.L.K., Hong, H., Cadieu, C.F., Solomon, E.A., Seibert, D., DiCarlo, J.J.: Performance-optimized hierarchical models predict neural responses in higher visual cortex. PNAS. 111 , 8619–8624 (2014). https://doi.org/10.1073/pnas.1403112111 Supplementary: figures Additional Declarations There is NO Competing Interest. Supplementary Files RS124.pdf Reporting Summary GA.png SupplementaryfiguresTables.docx Cite Share Download PDF Status: Under Review Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8259624","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":570365084,"identity":"aba8bc03-1f11-4c42-8bb2-0dd142b90eec","order_by":0,"name":"Yu Zhang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7ElEQVRIiWNgGAWjYHACNhAhZ8DMwPgAyODhI1aLMVALswFICxuxWhI3ABkScC4+YHAj/dmDnztqGbezMz+r/JpjJ8PGwPzw0Q28WhLSDXvPHGe2bGYzuy27LRnoMDZj4xz8Wo5J8LYdYzM4zMN2W3IbM1ALD5s0fi2JbZJ/247xgLQUS26rJ0ZLMps0b1uNBEgL48dthwlrkTzzjE1atu2AAdAvxtKM247zsDET8Avf8fRnkm/b6uq38x9++PHntmp7fvbmh4/xaVE4AKYOg0lmHjCJRzkIyDeAqTowyfiDgOpRMApGwSgYmQAAcFpE0+VUoH4AAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0001-7159-4561","institution":"School of Psychology, Shanghai Jiao Tong University","correspondingAuthor":true,"prefix":"","firstName":"Yu","middleName":"","lastName":"Zhang","suffix":""},{"id":570365085,"identity":"e7af51ba-0ad1-45a7-8f1e-770e75a670fe","order_by":1,"name":"Yuxuan Tu","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Yuxuan","middleName":"","lastName":"Tu","suffix":""},{"id":570365086,"identity":"a0cfa3ad-3e23-47de-928b-8d1781f74189","order_by":2,"name":"Zihan Yin","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Zihan","middleName":"","lastName":"Yin","suffix":""},{"id":570365087,"identity":"a2b90370-c18a-4149-b75e-bd3bc995b016","order_by":3,"name":"Jing Zhang","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Jing","middleName":"","lastName":"Zhang","suffix":""},{"id":570365088,"identity":"12aa5c8a-4374-473d-b932-21f7838e9b48","order_by":4,"name":"Weiyang Shi","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Weiyang","middleName":"","lastName":"Shi","suffix":""},{"id":570365089,"identity":"4d522dcf-5f9d-4dd3-9c08-d8b73ed8aeab","order_by":5,"name":"Siyang Li","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Siyang","middleName":"","lastName":"Li","suffix":""},{"id":570365090,"identity":"041ca529-840d-460b-968c-aeae7678f532","order_by":6,"name":"Jingguo Dai","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Jingguo","middleName":"","lastName":"Dai","suffix":""},{"id":570365091,"identity":"475b917f-e062-48d9-8198-130d20055b73","order_by":7,"name":"Yongfu Hao","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Yongfu","middleName":"","lastName":"Hao","suffix":""},{"id":570365092,"identity":"bcea2bac-4db8-4a8c-8ff8-daced056c284","order_by":8,"name":"Tianzi Jiang","email":"","orcid":"https://orcid.org/0000-0001-9531-291X","institution":"Brainnetome Center, Institute of Automation, Chinese Academy of Sciences","correspondingAuthor":false,"prefix":"","firstName":"Tianzi","middleName":"","lastName":"Jiang","suffix":""}],"badges":[],"createdAt":"2025-12-02 11:03:15","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8259624/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8259624/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":100024957,"identity":"e215e916-1880-4dfe-9eba-fdf558ee84c7","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":10445465,"visible":true,"origin":"","legend":"","description":"","filename":"BrainEncoding1202maintext.docx","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/7765ca8c24afe56f96d68935.docx"},{"id":100024997,"identity":"ec2418ef-d8c1-4d1a-9c95-765db8cdc12d","added_by":"auto","created_at":"2026-01-12 08:25:14","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1958500,"visible":true,"origin":"","legend":"","description":"","filename":"Figure2visionmodelvitnew.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/74a9edf440be505bd1207571.pdf"},{"id":100024956,"identity":"e690ae65-572f-4aff-ad18-163bcabb1d6d","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"pdf","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":941957,"visible":true,"origin":"","legend":"","description":"","filename":"Figure3languagemodelbert.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/09c9df4e5707ecfeec924bc1.pdf"},{"id":100025056,"identity":"2f1f7c84-b9f5-414b-8bc4-3bead9c2ffe2","added_by":"auto","created_at":"2026-01-12 08:25:17","extension":"pdf","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1857698,"visible":true,"origin":"","legend":"","description":"","filename":"Figure4modelcomparison.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/ce174d5973dccdd9ad0579b4.pdf"},{"id":100024978,"identity":"71d814d6-4cab-4b27-a1f6-8387535f9ca9","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"pdf","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":880301,"visible":true,"origin":"","legend":"","description":"","filename":"Figure5languagederivedlateralization.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/2452b43eebb1398987a95119.pdf"},{"id":100024958,"identity":"19bda7a4-c9ca-455a-8471-57c65ab66a0e","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"pdf","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1210250,"visible":true,"origin":"","legend":"","description":"","filename":"Figure6varabilitymap.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/2d5078dbe42b72a57c0eddb6.pdf"},{"id":100025030,"identity":"89a50283-b0fa-4c0d-88f2-06db1a4b97a2","added_by":"auto","created_at":"2026-01-12 08:25:15","extension":"pdf","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1310619,"visible":true,"origin":"","legend":"","description":"","filename":"Figure7individualmodelvilt.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/477d065e563109227f3ed495.pdf"},{"id":100024954,"identity":"b5c41906-6ed0-4b71-bd17-7132efdf238d","added_by":"auto","created_at":"2026-01-12 08:25:11","extension":"pdf","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2182928,"visible":true,"origin":"","legend":"","description":"","filename":"Figure8multihotsemanticmap.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/d48f060a0a7500e0ee840807.pdf"},{"id":100024982,"identity":"5add344b-18a4-41b9-8da5-0b390cba173a","added_by":"auto","created_at":"2026-01-12 08:25:13","extension":"json","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":10177,"visible":true,"origin":"","legend":"","description":"","filename":"COMMSBIO2511702.json","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/1959aba2aafc97e0ca2863c0.json"},{"id":100024994,"identity":"b483a355-fb7f-49c7-8dc1-4784c2f48470","added_by":"auto","created_at":"2026-01-12 08:25:13","extension":"xml","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":175448,"visible":true,"origin":"","legend":"","description":"","filename":"COMMSBIO25117020enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/7543c8268575685f0a0ff633.xml"},{"id":100024968,"identity":"cbb80183-3aeb-479f-b13b-4f5d81361e9a","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"pdf","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1512758,"visible":true,"origin":"","legend":"","description":"","filename":"Figure1jointembeddingvilt.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/b004399c820d32f4d5f29fdb.pdf"},{"id":100363018,"identity":"05322b24-50c5-4ca0-a033-fec736e1bd65","added_by":"auto","created_at":"2026-01-16 07:48:35","extension":"pdf","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1958500,"visible":true,"origin":"","legend":"","description":"","filename":"Figure2visionmodelvitnew.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/0f3c8dfddc094c51aa6c14a7.pdf"},{"id":100362793,"identity":"2f8f3be8-1bcf-457e-a162-d46e60d48ef3","added_by":"auto","created_at":"2026-01-16 07:48:05","extension":"pdf","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":941957,"visible":true,"origin":"","legend":"","description":"","filename":"Figure3languagemodelbert.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/1765c09a450cca856ae57a55.pdf"},{"id":100025055,"identity":"991b164b-d19f-4b83-ad5b-73a2bdf56310","added_by":"auto","created_at":"2026-01-12 08:25:17","extension":"pdf","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1857698,"visible":true,"origin":"","legend":"","description":"","filename":"Figure4modelcomparison.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/db2637d9e1a4c958d72a15c3.pdf"},{"id":100025060,"identity":"a5bc2179-645b-42df-821d-9ba20bddc44a","added_by":"auto","created_at":"2026-01-12 08:25:17","extension":"pdf","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":880301,"visible":true,"origin":"","legend":"","description":"","filename":"Figure5languagederivedlateralization.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/45cefa3c65efda5a19941882.pdf"},{"id":100362274,"identity":"d584b776-be73-436d-93f8-9cc97e209e50","added_by":"auto","created_at":"2026-01-16 07:46:28","extension":"pdf","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1210250,"visible":true,"origin":"","legend":"","description":"","filename":"Figure6varabilitymap.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/24de3850846a8a603a1ca28c.pdf"},{"id":100363087,"identity":"7ab89049-9470-49e6-b87f-5803b63eff3e","added_by":"auto","created_at":"2026-01-16 07:48:47","extension":"pdf","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1310619,"visible":true,"origin":"","legend":"","description":"","filename":"Figure7individualmodelvilt.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/9c0333c92cc9d8f8482fb683.pdf"},{"id":100025064,"identity":"adc05f86-e4ed-4ce5-bf4f-323b43a024b3","added_by":"auto","created_at":"2026-01-12 08:25:20","extension":"pdf","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2182928,"visible":true,"origin":"","legend":"","description":"","filename":"Figure8multihotsemanticmap.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/6922a35c62693637339ce42a.pdf"},{"id":100025059,"identity":"f24da316-1ea8-44c8-91ee-beb9115c0f8d","added_by":"auto","created_at":"2026-01-12 08:25:17","extension":"png","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":947404,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/be29e4289dbe392c2275153c.png"},{"id":100024975,"identity":"33f7210a-0840-4329-be92-7086a966a89d","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"png","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":978169,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/afc970f60da8bf102e2aabed.png"},{"id":100025029,"identity":"d871d67c-764d-4c35-984a-d24fb55bc291","added_by":"auto","created_at":"2026-01-12 08:25:15","extension":"png","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1057932,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/0d5b3d0268586ac5595ecc62.png"},{"id":100024951,"identity":"d035160d-f94d-4113-a45e-53959e46a503","added_by":"auto","created_at":"2026-01-12 08:25:11","extension":"png","order_by":22,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":438062,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage12.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/69dfcde73fa19def78060c81.png"},{"id":100025010,"identity":"2f115467-6419-4778-8ef2-098e2d9dc09f","added_by":"auto","created_at":"2026-01-12 08:25:14","extension":"png","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":843425,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage13.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/ce8b40d964e1caaed1e05c5d.png"},{"id":100024990,"identity":"3d4f246b-48f6-40eb-b038-97e0f7390a62","added_by":"auto","created_at":"2026-01-12 08:25:13","extension":"png","order_by":24,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":165443,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage14.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/e9d311d77b285a5c78c1e1a6.png"},{"id":100025057,"identity":"8ad6258f-df1e-4f2d-a736-fcb819d9cfe3","added_by":"auto","created_at":"2026-01-12 08:25:17","extension":"png","order_by":25,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":173474,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage15.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/4aaf62efca9ef0b7ed6ca4ec.png"},{"id":100024984,"identity":"a2cda7d0-9092-41c6-b0b8-87ec00adaa3a","added_by":"auto","created_at":"2026-01-12 08:25:13","extension":"png","order_by":26,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":561503,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/3791e4fd574a3ce39c32f2a6.png"},{"id":100024962,"identity":"5fbcf32d-f41c-4c3d-ba85-79dd55df9263","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"png","order_by":27,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":625163,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/be52c0f3799f4cde4e07cb09.png"},{"id":100362465,"identity":"52d52877-ac57-498b-a916-9be2923b81b8","added_by":"auto","created_at":"2026-01-16 07:46:49","extension":"png","order_by":28,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":402809,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/4466b2f4e1bf23bd15e3d995.png"},{"id":100024952,"identity":"bbdcec60-5a7e-4195-b36e-cf36614d0852","added_by":"auto","created_at":"2026-01-12 08:25:11","extension":"png","order_by":29,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":934888,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/9383d18389c14719a9e5114a.png"},{"id":100024986,"identity":"470804dc-413b-4d71-968c-1d171964ea57","added_by":"auto","created_at":"2026-01-12 08:25:13","extension":"png","order_by":30,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":597910,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/da67f9b71f50b1a8a0d5e076.png"},{"id":100361752,"identity":"05cb27c8-7231-46b0-86cc-97cc629484b7","added_by":"auto","created_at":"2026-01-16 07:45:41","extension":"png","order_by":31,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":856431,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/51f3bf8c3e2822d76fa996e1.png"},{"id":100024967,"identity":"49324950-775a-4838-a875-af6ee6c04ef1","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"png","order_by":32,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":884255,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/4c7b9805b11944dad02e2b98.png"},{"id":100024988,"identity":"426ed5a3-b01e-47bb-893d-fe89135f03be","added_by":"auto","created_at":"2026-01-12 08:25:13","extension":"png","order_by":33,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":830882,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/37375e9b6b0741ed13cd84e5.png"},{"id":100025023,"identity":"e6941d2c-ad70-4408-8f25-9fcd4abfd6a4","added_by":"auto","created_at":"2026-01-12 08:25:15","extension":"png","order_by":34,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":180854,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/9a87dfbb7f37f54128fedac5.png"},{"id":100362171,"identity":"04f03e6b-1567-42d4-8176-8d5fa814ee69","added_by":"auto","created_at":"2026-01-16 07:46:15","extension":"png","order_by":35,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":196928,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/8920aebdd752f68db80dded9.png"},{"id":100024963,"identity":"214095cd-ea87-4e07-ad28-4a22211210e6","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"png","order_by":36,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":168972,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/5a95abe00db48e6e67fe2484.png"},{"id":100024950,"identity":"6f8a8d9b-00e1-47b8-8b52-2f0ad29c9b98","added_by":"auto","created_at":"2026-01-12 08:25:11","extension":"png","order_by":37,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":71994,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage12.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/10e5d57664ed7498bc734e64.png"},{"id":100024992,"identity":"c42324bf-311f-427c-92d5-f301e128abe3","added_by":"auto","created_at":"2026-01-12 08:25:13","extension":"png","order_by":38,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":152844,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage13.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/bccea492211a17005dfa754e.png"},{"id":100362950,"identity":"c5cc96ce-7af9-4034-9435-5e50586a3b47","added_by":"auto","created_at":"2026-01-16 07:48:22","extension":"png","order_by":39,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":42689,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage14.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/732ff2a08aba8e47a8f50229.png"},{"id":100024976,"identity":"2e6d8126-e14f-4552-938d-c9907770a488","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"png","order_by":40,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":43384,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage15.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/c156c4d8d865278a6f8c0b86.png"},{"id":100362310,"identity":"753013de-1e8c-4b1d-9fdb-3319aa5574b3","added_by":"auto","created_at":"2026-01-16 07:46:33","extension":"png","order_by":41,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":108303,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/815e68bacd5da6a0b5258092.png"},{"id":100363058,"identity":"e9c14870-78f5-4d01-b702-0a1ffbcde1fd","added_by":"auto","created_at":"2026-01-16 07:48:43","extension":"png","order_by":42,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":116874,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/ef1ade57e7fed589dbb3392a.png"},{"id":100025061,"identity":"0c39830e-7bb7-4119-9e83-55a85e6b66a7","added_by":"auto","created_at":"2026-01-12 08:25:17","extension":"png","order_by":43,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":86855,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/6fa7314238fe4f1ea5a6b1c0.png"},{"id":100025028,"identity":"8c5c73c7-6fe6-47db-a712-d7a1911a1f35","added_by":"auto","created_at":"2026-01-12 08:25:15","extension":"png","order_by":44,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":188610,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/6aa8f69a80ffe0bab4e2724f.png"},{"id":100025031,"identity":"1a3c9cf2-ea74-46ef-bcfd-687bb8217cd5","added_by":"auto","created_at":"2026-01-12 08:25:15","extension":"png","order_by":45,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":105646,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/0d5f971c25a86c4b6e6a226a.png"},{"id":100362672,"identity":"56d162be-1c34-4663-a47b-ee7a86ebec88","added_by":"auto","created_at":"2026-01-16 07:47:52","extension":"png","order_by":46,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":162678,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/81955354a3a1a09e55c88d9b.png"},{"id":100363083,"identity":"aa8306f3-c239-43cd-ace6-5592a0c9bee2","added_by":"auto","created_at":"2026-01-16 07:48:46","extension":"png","order_by":47,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":178146,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/d8c52d2001f7a2ce2165c9f8.png"},{"id":100024966,"identity":"c0f5ec7f-6993-4ea6-908e-adadc6b2001c","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"png","order_by":48,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":162266,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/766f44078e977e9083836642.png"},{"id":100025052,"identity":"66aabcbb-d195-4a6f-b3cf-baf39c947f38","added_by":"auto","created_at":"2026-01-12 08:25:16","extension":"xml","order_by":49,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":171759,"visible":true,"origin":"","legend":"","description":"","filename":"COMMSBIO25117020structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/6ceda5d8f5288abcc48e1724.xml"},{"id":100024949,"identity":"ff649bc0-8060-4acf-b269-54a0f48f79a5","added_by":"auto","created_at":"2026-01-12 08:25:11","extension":"html","order_by":50,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":193836,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/cc73d3df3b987c8e92b0b8f9.html"},{"id":100025033,"identity":"f6df1097-9c1d-4681-94af-85d141a298a1","added_by":"auto","created_at":"2026-01-12 08:25:16","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":709589,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eMultimodal brain encoding reveals the dominant role of language-derived representations fornatural scene understanding\u003c/strong\u003e. \u003cstrong\u003e(A)\u003c/strong\u003e Joint image-text embedding using ViLT significantly predict 7T fMRI responses to natural scenes. \u003cstrong\u003e(B)\u003c/strong\u003e Mapping the prediction accuracy of encoding models (Pearson \u003cem\u003er\u003c/em\u003e) onto large-scale brain networks. Compared to unimodal vision models (ResNet, ViT), language (BERT) and multimodal (ViLT) models showed superior performance across widely distributed brain networks, particularly within the frontoparietal and attention networks. The inter-subject variability map of language and multimodal models exhibited a highly similar distribution, which was distinct from the vision models. \u003cstrong\u003e(C) \u003c/strong\u003eVertex-wise encoding maps of CNN-based (ResNet) and transformer-based (ViT) vision models, language models (BERT), and multimodal models (ViLT). Vision-derived features from ViT and ResNet significantly predict neural activity primarily within the visual cortex. Language-derived representations from BERT yield widespread significant predictions (FDR corrected, p\u0026lt;0.01) that extend into frontoparietal, dorsal-attention and default-mode networks. Joint image-text embeddings from ViLT demonstrate the highest encoding performance across both the visual cortex and other associated brain networks, highlighting the advantages of multimodal fusion for combining both visual and linguistic information.\u003c/p\u003e","description":"","filename":"Figure1jointembeddingvilt.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/08747f5fa2b34747da647734.png"},{"id":100024998,"identity":"6ef0edb6-3123-4e0b-b949-c11ac9148a52","added_by":"auto","created_at":"2026-01-12 08:25:14","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":939660,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHierarchical and architectural specialization in visual-based encoding models.\u003c/strong\u003e \u003cstrong\u003e(A)\u003c/strong\u003eVertex-wise encoding performance (Pearson \u003cem\u003er\u003c/em\u003e) for the four residual blocks of ResNet-50 and the last layer of ViT-B/16, plotted on the fsaverage-7 surface. \u003cstrong\u003e(B) \u003c/strong\u003eMapping encoding performance of ResNet-50 residual blocks and ViT onto individual visual areas, defined by the Kastner atlas (Wang et al., 2015a). Our results resembled the cortical hierarchy of visual processing that low-level CNN layers encoded activity in V1-V3 (edges and contour), mid-level CNN layers encoded V3A/B and VO1/2 (object shape and boundaries), deepest ResNet blocks encoded hMT+(motion and direction), while ViT encoded both low-level features in V1-V3, motion information in the dorsal “where” stream, and object categoriesin the ventral “what” stream. \u003cstrong\u003e(C)\u003c/strong\u003eMapping encoding performance onto the Yeo-7 functional networks. The ViT-based encoding model significantlyoutperforms ResNet-50 in the visual (VIS), dorsal attention (DAT), and ventral attention (VAT) networks. Network abbreviations: VIS,visual system; DAT, dorsal attention network; VAT,ventral attention network; FPN, frontoparietal network; DMN,default mode network; LIM, limbic system; SOM,sensorimotor network.\u003c/p\u003e","description":"","filename":"Figure2visionmodelvitnew.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/28dcadb9b209691404afd7a6.png"},{"id":100025058,"identity":"2f41ba39-949a-4fb3-9de7-c828fced7bcb","added_by":"auto","created_at":"2026-01-12 08:25:17","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":431820,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eBrain encoding using semantic embeddings derived from BERT. (A)\u003c/strong\u003eCortical map of language-based encoding models by implementing BERT on the 5-sentence text captions provided by the COCO dataset. \u003cstrong\u003e(B) \u003c/strong\u003eEncoding performance of BERT using varying caption lengths. Violin plots show the distribution of prediction accuracy (\u003cem\u003er\u003c/em\u003e) for the top 1% best-predicted cortical vertices. Semantic embeddings of natural scenes were extracted by implementing the BERT model on text captions of varying lengths (short, middle, long or full 5-sentence captions), and then used to predict fMRI responses. The monotonic increase of encoding accuracy as the caption lengthincreases demonstrates that richer linguistic context enhances the semantic prediction of brain activity, even in the absence of visual information.\u003c/p\u003e","description":"","filename":"Figure3languagemodelbert.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/473755dba7a1d5a9c37c013b.png"},{"id":100025065,"identity":"f9c762ad-c3c7-4e50-8b9a-2f811e937ba7","added_by":"auto","created_at":"2026-01-12 08:25:20","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":861613,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCortical maps of visual-based, language-based, and multimodal encoding models\u003c/strong\u003e. \u003cstrong\u003e(A) \u003c/strong\u003eSpatial distribution of encoding performance for unimodal and multimodal encoding models, including CNN-based (ResNet) and Transformer-based (ViT) vision models, language models (MultiHot, BERT) and the joint image-text embedding model (ViLT). \u003cstrong\u003e(B)\u003c/strong\u003e Distribution of vertex-wise encoding performance across all model layers and types. \u003cstrong\u003e(C)\u003c/strong\u003e Encoding performance for the top 5% best-predicted cortical vertices. Among which, the joint image-text embedding (ViLT) achieved the best encoding performance, followed by language-based models (MultiHot, BERT). All types of vision models (including ResNet blocks and ViT) demonstrated relatively low encoding performance. \u003cstrong\u003e(D)\u003c/strong\u003e Mapping encoding performance onto the Yeo-7 functional networks. Multimodal encoding model (ViLT) exhibited the highest encoding accuracy in visual, frontoparietal, dorsal- and ventral-attention networks. In contrast, language-based models (MultiHot, BERT) outperformed vision-based models in frontoparietal, attention and default-mode networks. Vision-based models (ViT) only demonstrated superior performance in the early visual areas (V1-V2). Even in these visual areas, ViLT still achieved the best overall predictions.\u003c/p\u003e","description":"","filename":"Figure4modelcomparison.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/5e3a658eee96d2e2a09265cb.png"},{"id":100362901,"identity":"ee342e26-d505-4936-a745-1646a290eb6e","added_by":"auto","created_at":"2026-01-16 07:48:15","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":459746,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eJoint embedding of visual and linguistic features enhances cortical representations of natural scenes\u003c/strong\u003e. \u003cstrong\u003e(A-B) \u003c/strong\u003eVertex-wise difference map illustrated the performance gain of the multimodal ViLT model over unimodal models (ViLT - BERT and ViLT - ViT), plotted on the fsaverage surface. Warm colors denote vertices where the joint embedding model significantly outperformed unimodal models. ViLT demonstrated a widespread advantage over unimodal encoding models, by outperformingBERT in high-order visual cortex, and surpassing ViT throughout dorsal and ventral streams, as well as in assoicated areas of frontal, parietal and temporal lobes. Only a small, localized performance deficit was observed in early visual cortex (V1-V3) relative to ViT. \u003cstrong\u003e(C) \u003c/strong\u003eMapping encoding performance (\u003cem\u003er\u003c/em\u003e) onto individual visual areas defined by the Kastner atlas. Encoding performance was qualified for early (V1-V3), intermediate (V3A/B, LO1/2, VO1/2) and high-level (IPS, hMT+, FEF) visual areas. Among which, multimodal (ViLT) encoding models showed the highest encoding accuracies compared to unimodal vision (ViT) and language (MultiHot, BERT) models.\u003cstrong\u003e (D-E) \u003c/strong\u003eHemispheric lateralizationof vertex-wise encoding accuracy, focusing on the top 5% best-predicted cortical vertices. Vision-based (ResNet, ViT) models showed a bilaterally distributed encoding performance, suggesting roughly equal contribution of both hemispheres for vision perception. In contrast, language-based (MultiHot, BERT) and multimodal (ViLT) models demonstrated a significant left-hemisphere lateralization effect in the encoding performance, consistent with the typical dominance of the left hemisphere for language and semantic processing. Furthermore, language-based (BERT) model maintained a reliable pattern of left-hemisphere lateralization across all tested caption lengths. This consistent lateralization pattern was also observed in other participants, reinforcing the language-driven nature of scene understanding.\u003c/p\u003e","description":"","filename":"Figure5languagederivedlateralization.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/e069fa642fa7b29fea631800.png"},{"id":100024970,"identity":"5bbfd504-c1ff-45c9-9bbe-752f3c45e9bd","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":588860,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eLanguage, not vision, primarily accounts for individual differences in the natural-scene understanding.\u003c/strong\u003e \u003cstrong\u003e(A-B) I\u003c/strong\u003enter-subject variability maps of vertex-wise encoding models across participants for unimodal vision-based (ViT), language-based (BERT) and multimodal (ViLT) models. The BERT (using text captions) and ViLT (using both image features and text captions) exhibited high variability in anterior temporal, inferior-frontal and parietal cortices. These regions were associated with high-level cognitive and semantic processing. In contrast, ViT (using image features only) showed relative low overall variability, which was largely confined to the early visual cortex and lateral occipital areas. \u003cstrong\u003e(C) \u003c/strong\u003eCorrelation analysis of vertex-wise variability maps. The inter-subject variability of the ViLT model was highly predicted by BERT (r = 0.97), indicating a shared source of individual differences between two models, but showed a moderate correlation with ViT (r = 0.74). Our results suggest that individual differences in scene comprehension are overwhelmingly dominated by language-mediated semantic processing (as captured by BERT and ViLT), rather than by visual perception (as captured by ViT) .\u003c/p\u003e","description":"","filename":"Figure6varabilitymap.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/9394e5b8feaee4d866ca6683.png"},{"id":100025032,"identity":"2dc6e152-2811-4732-8bf5-9466e1e8b509","added_by":"auto","created_at":"2026-01-12 08:25:16","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":608658,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eInter-subject consistency and variability of the ViLT multimodal encoding model. \u003c/strong\u003eThe vertex-wise encoding accuracy of the ViLT model demonstrated consistent and reliable predictions across all four participants, particularly in high-order visual, temporo-occipital, and fronto-parietal cortices. This suggests a population-wise stable, multimodal cortical signature for scene understanding. In the meanwhile, the encoding models showed notable individual variability in the lateral temporal, occipital, and frontal regions. This variability supports the hypothesis that a language-mediated process facilitates the integration of higher-order contextual information with long-term semantic knowledge in natural-scene understanding.\u003c/p\u003e","description":"","filename":"Figure7individualmodelvilt.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/a67bcd658e173c30007ffcb9.png"},{"id":100024965,"identity":"0840b941-b3d9-45e3-a922-dd15ad873203","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":1134109,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCortical semantic atlas of objects and concepts in natural scenes.\u003c/strong\u003e \u003cstrong\u003e(A) \u003c/strong\u003eUnified color scheme for objects and concepts based on the first three PCs of the group-level multi-hot encoding model. Blue: PC1 indicates the distinction of animate and inanimate subjects. Green: PC2 encodes action-related and movable objects. Purple: PC3 represents person, man-made and civilization-related concepts. \u003cstrong\u003e(B) \u003c/strong\u003eCortical semantic atlas for an example subject (Subject 01). The color scheme of each vertex represents the loading of three semantic components. \u003cstrong\u003e(C) \u003c/strong\u003eCortical semantic atlas for three additional subjects (Subjects 02, 05, and 07). These cortical maps demonstrate highly consistent semantic mapping across individuals, with animate subjects represented in early visual cortex (blue), man-made and civilization-related concepts encoded in the ventral what stream (purple), action-related and movable objects encoded in the dorsal where stream (green). Only significant predictions (FDR corrected, p\u0026lt;0.001) were plotted on the map. \u003cstrong\u003e(D) \u003c/strong\u003eProjection of the COCO “thing” categories onto the first three principal components (PC1 vs PC2 vs PC3). The font size and color scheme of each item indicate the loadings of objects in the corresponding semantic component, with red colors indicating positive loadings and blue indicating negative loadings. PC1 distinguishes animate from inanimate items (e.g., animals and person), whereas PC2 captures movable objects and transportation, PC3 indexes civilization-related concepts.\u003c/p\u003e","description":"","filename":"Figure8multihotsemanticmap.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/ec1ee3019140a6ba98a135c7.png"},{"id":100406336,"identity":"7bb4a974-df11-4678-8936-3dd081270219","added_by":"auto","created_at":"2026-01-16 13:00:39","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":7267909,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/2faa2ff0-ec12-4a85-838b-ead3c0f0f897.pdf"},{"id":100025053,"identity":"91e2045f-dd0e-4930-b84d-2384070186bc","added_by":"auto","created_at":"2026-01-12 08:25:16","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":83886,"visible":true,"origin":"","legend":"Reporting Summary","description":"","filename":"RS124.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/c573e1c14216d395f5d4b3ae.pdf"},{"id":100024948,"identity":"58c11e02-f239-4b70-9dde-1f6ec91e3c6c","added_by":"auto","created_at":"2026-01-12 08:25:11","extension":"png","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":879021,"visible":true,"origin":"","legend":"","description":"","filename":"GA.png","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/13c2cd830ec5a7c7ca74f800.png"},{"id":100024973,"identity":"3a552a68-89cb-49cd-867a-fc6667c792f8","added_by":"auto","created_at":"2026-01-12 08:25:12","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":3677655,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryfiguresTables.docx","url":"https://assets-eu.researchsquare.com/files/rs-8259624/v1/56b7246ad6ba0654ea88f8ae.docx"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"The Semantic Scaffold: Functional Dissociation of Visual and Language-derived Features Shapes Human Natural Scene Understanding","fulltext":[{"header":"Highlights","content":"\u003cul start=\"50\"\u003e\n \u003cli\u003eMultimodal AI models reveal a clear functional dissociation between sensory-driven visual processing and language-derived semantic pathway.\u003c/li\u003e\n \u003cli\u003eThis establishes a distinct semantic encoding system in frontal and temporal association cortices, distinct from the feed-forward, bottom-up visual processing.\u003c/li\u003e\n \u003cli\u003eMultimodal integration of visual and semantic features yields widespread superior predictions relative to their unimodal counterparts, validating the integration component of the Semantic Scaffold framework.\u003c/li\u003e\n \u003cli\u003eThis semantic system is organized as a unified atlas, characterized by a dominant animate-vs-inanimate gradient and a robust left-hemisphere lateralization.\u003c/li\u003e\n\u003c/ul\u003e"},{"header":"Significance Statement","content":"\u003cp\u003eHow the human brain understands a complex visual scene is traditionally modeled as a feed-forward, visual-centric process. This work challenges this traditional view, proposing a Semantic Scaffold framework where language-derived knowledge is a foundational component in natural-scene understanding that actively shapes visual perception. Using high-resolution 7T fMRI and advanced computational models, we establish two core components of this framework: 1) functional dissociation between visual pathways and distinct semantic pathway in the frontal and temporal lobes, and 2) an integration process whereby these pathways converge to form a unified perception. This work repositions language-derived knowledge as a primary, active component in how humans build a coherent perception of the world.\u003c/p\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e"},{"header":"Introduction","content":"\u003cp\u003eDeciphering how the human brain constructs a coherent interpretation of the visual world remains a fundamental challenge in neuroscience, requiring the flexible association of high-resolution sensory input with the abstract, conceptual knowledge. Classically, the neural basis of object recognition and scene understanding has been modeled as a hierarchical, feed-forward processing, extending from early sensory areas (V1) through the ventral visual stream to high-level temporal cortices (IT/VTC) (DiCarlo et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2012\u003c/span\u003e; Yamins and DiCarlo, \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2016\u003c/span\u003e). In this view, semantic interpretation is commonly regarded as an emergent property derived primarily from the bottom-up processing of visual features. This visual-centric paradigm has been reinforced by recent advances in neuro-AI, where deep neural networks trained solely on visual tasks (e.g., AlexNet, ResNet) have proven remarkably effective at predicting neural activity across the primate and human visual system (Bonnen et al., \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; G\u0026uuml;\u0026ccedil;l\u0026uuml; and Gerven, 2015; Horikawa and Kamitani, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Schrimpf et al., \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Yamins et al., \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2014\u003c/span\u003e). These vision-based models, however, excel primarily at explaining neural dynamics related to bottom-up sensory processing while often neglecting the profound influence of top-down factors and semantic context, which are central to theories of predictive processing (Millidge et al., \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Salvatori et al., \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2021\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eAccumulating evidence suggests that real-world scene comprehension is rarely a purely bottom-up visual exercise, but instead relies heavily on the top-down semantic grounding that employs abstract conceptual knowledge to resolve perceptual ambiguity and structure our understanding of the environment (Bi, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Lupyan et al., \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). This reliance on conceptual knowledge implies that the neural architecture of scene understanding involves the interplay and convergence of sensory-derived representations and abstract, language-derived knowledge. Prior work has successfully mapped large-scale semantic spaces using narrative speech or isolated linguistic stimuli (Huth et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2016\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2012\u003c/span\u003e; Mitchell et al., \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2008\u003c/span\u003e; Popham et al., \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). More recently, several studies have demonstrated that multimodal deep learning models (combining vision and text) predict brain responses more accurately than unimodal models (Bonner and Epstein, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Doerig et al., \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). While these studies have established the computational utility of multimodal models, the biological architecture underlying this improvement remains unresolved. It remains unclear whether language-derived semantic information is simply integrated into the visual stream itself, or if it reflects the recruitment of a distinct, anatomically dissociable pathway that creates a conceptual scaffold for vision. Furthermore, the precise functional topography of where perceptually-driven visual features dissociate from abstract semantic features has not been definitively mapped during naturalistic processing. To address this critical gap, we propose the Semantic Scaffold framework, positing that human scene understanding relies on two core components: 1) a functional dissociation between two representational streams, constituting of a bottom-up visual pathway and a top-down, language-derived pathway that provides semantic context; 2) an integration process whereby the language-derived pathway provides a contextual scaffold to actively shape the coherent perception.\u003c/p\u003e \u003cp\u003eTo test this hypothesis, we leveraged a suite of unimodal (visual-only, language-only) and multimodal (visual-language) deep learning architectures as computational probes against the massive 7T fMRI Natural Scenes Dataset (NSD; (Allen et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) to systematically dissect the unique cortical contributions of visual- versus language-derived representations. Our analysis reveals three key findings that advance the mechanistic understanding of natural-scene processing. First, we identify a functional dissociation between the two processing streams. While perceptually-driven visual features are strictly confined to the visual cortex, language-derived semantic features robustly predict activity across expansive frontal and temporal association cortices, independent of visual complexity. Second, we demonstrate that multimodal integration is critical for modeling activity at the interface of the two systems, providing empirical support for a multimodal mechanism where top-down semantic knowledge contextually modulates visual input. Finally, we resolve the internal structure of this semantic scaffold, revealing a unified atlas organized around a dominant animate-inanimate axis with robust left-hemisphere lateralization. Collectively, these findings reposition language-derived knowledge from a secondary consequence of vision to a foundational, active scaffold that shapes human experience of the visual world.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eVisual encoding models map the cortical hierarchy of the visual system\u003c/h2\u003e \u003cp\u003eWe first validated our encoding framework by demonstrating its ability to recapitulate the well-established cortical hierarchy of the human visual system. As expected, vision-based encoding models using convolutional neural networks (CNN; e.g., ResNet-50) and vision transformer (ViT) successfully predicted neural activity across the entire visual cortex. In addition, we observed a clear hierarchical progression in CNN-based encoding maps: shallower layers best predicted activity in early visual areas (V1-V3), while deeper layers extended to higher-order regions in the ventral (\u0026ldquo;what\u0026rdquo;) and dorsal (\u0026ldquo;where\u0026rdquo;) pathways. Using latent features from the four residual blocks of ResNet-50, we found a clear functional specialization progressing along the ventral and dorsal pathways (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA). Specifically, the earliest layer (Block 1) selectively predicted activity in early retinotopic areas (V1-V3), reflecting the encoding of low-level features like edges and contrast. The intermediate layers (Blocks 2 and 3) extended the predictive power to higher-order regions in both ventral (e.g., LOC, pFs) and dorsal (e.g., V3A/B, IPS) streams, indicating increasing selectivity for intermediate shape and spatial geometry. Notably, Block 3 additionally engaged scene- and motion-selective periphery (PPA, OPA, hMT+), consistent with its role in developing abstract, position-tolerant representations for recognizing complex patterns and moving objects. The deepest layer (Block 4) showed minimal correspondence with early visual cortex, instead achieving significant predictions in higher-order regions dedicated to visuospatial and motion processing (e.g., hMT+, posterior IPS), representing global attributes of the scene like spatial location and dynamic motion. This functional specialization confirmed that CNN-based encoding models mirror the cortical hierarchy of visual processing, progressing from low-level features and shape geometry to abstract representation of complex patterns and object interaction.\u003c/p\u003e \u003cp\u003eA direct comparison of CNN-based and transformer-based architectures revealed complementary, anatomically specific advantages (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB). Compared to ResNet, ViT yielded significantly higher prediction accuracy in early visual cortex (V1-V3) and along the ventral \u0026ldquo;what\u0026rdquo; pathway (e.g., LOC, FFA, PPA), suggesting that ViT\u0026rsquo;s self-attention mechanisms effectively captured the fine-grained, holistic details required for robust object recognition. In contrast, ResNet excelled along the dorsal \u0026ldquo;where\u0026rdquo; pathway (e.g., V3A/B, IPS, MT+), consistent with established benefit of using translation-equivariant kernels for processing information related to spatial location and motion. Furthermore, mapping these vertex-wise prediction maps onto individual visual areas defined by the Kastner atlas (Wang et al., \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2015a\u003c/span\u003e), we confirmed a strict hierarchical correspondence among vision models that low-level CNN layers predicted activity in V1-V3 (edges and contour), intermediate layers predicted V3A/B and VO1/2 (object shape and boundaries), deepest layers predicted hMT+ (motion and direction), while ViT encoded both low-level details in V1-V3 and abstract representations in the ventral pathway (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB). When mapping these predictions onto large-scale functional networks (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC), we demonstrated that ViT\u0026rsquo;s advantage extended beyond the visual cortex and into the dorsal and ventral attention networks. This finding underscores the benefit of global self-attention architecture for modeling both sensory processing and associated attention-related cortical responses.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eLanguage models reveal a semantic system beyond the visual cortex\u003c/h3\u003e\n\u003cp\u003eWe hypothesized that scene representations derived from linguistic and semantic embeddings would encode neural activity extending beyond the boundaries of the classical visual cortex. Using an 80-dimensional multi-hot vector of COCO \u0026ldquo;thing\u0026rdquo; labels, we observed significant predictions (FDR corrected, p\u0026thinsp;\u0026lt;\u0026thinsp;0.01) across extensive frontal, parietal and temporal association regions (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). Critically, this simple multi-hot label vector surpassed the best vision-based models (ViT) in prediction accuracy across widely distributed functional areas, including high-order visual areas, frontoparietal, dorsal-attention and default-mode networks (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e4\u003c/span\u003eC). Substituting the multi-hot labels with semantic embeddings extracted from BERT (text only, using five-sentence captions of each scene provided by the COCO dataset) yielded a similar, broadly distributed cortical map (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). While these two types of models exhibited comparable encoding accuracy in high-order visual cortex, BERT achieved higher peak correlations in anterior temporal and inferior frontal regions compared to MultiHot, suggesting the extraction of superior abstract semantic features in BERT that are better suited for comprehensive scene understanding. Intriguingly, despite its lower complexity and fewer parameters, the MultiHot encoding model proved highly competitive by achieving a median prediction accuracy (\u003cem\u003er\u003c/em\u003e) only 3% less than BERT and produced only 1.7% fewer significant predictions (FDR corrected, p\u0026thinsp;\u0026lt;\u0026thinsp;0.01). Furthermore, both types of language-based encoding models exhibited similar predictive profiles along the dorsal and ventral streams (Figure S2). These results collectively demonstrate that language-derived representations robustly encode neural responses in the association cortex, consistently outperforming purely visual representations and extending their predictive power beyond higher-order visual areas into frontal, parietal, and temporal regions (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). This finding establishes the existence of a widespread cortical semantic system dedicated to representing abstract semantic knowledge for scene understanding, which is functionally distinct from visual appearance and extends far beyond the traditional boundaries of the visual cortex.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWe next investigated the effect of richer linguistic context on the encoding performance by implementing the BERT model on text captions of varying caption lengths, i.e., shortest (word count: 8.51\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\:\\pm\\:\\:\\)\u003c/span\u003e\u003c/span\u003e0.81), medium-length (10.12\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\:\\pm\\:\\:\\)\u003c/span\u003e\u003c/span\u003e1.21 words), longest captions (13.19\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\:\\pm\\:\\:\\)\u003c/span\u003e\u003c/span\u003e3.11 words), as well as the full 5-sentence captions. The resulting semantic embeddings were used to predict vertex-wise fMRI responses of each scene, separately. The overall whole-brain encoding maps remained highly consistent in their spatial distributions across different caption lengths (Figure S4 and S5), extending broadly from the visual cortex to the frontal, parietal and temporal lobes. However, a detailed analysis of the top 1% best-predicted cortical vertices (approximately 700 vertices per hemisphere) revealed a clear dependency on the linguistic context that both median and maximum prediction accuracy (\u003cem\u003er\u003c/em\u003e values) showed a monotonic increase with increasing caption length (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e3\u003c/span\u003eB\u003cb\u003e).\u003c/b\u003e In particular, BERT with the full 5-sentence captions achieved the highest encoding performance among language-based models, reaching a maximum encoding \u003cem\u003er\u0026thinsp;=\u003c/em\u003e\u0026thinsp;0.714 (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). The most accurately predicted areas included those engaged in the recognition of motion (e.g., hMT/MST), faces (e.g., FFA) and scenes (e.g., PPA, RSC). This finding demonstrates that richer linguistic context and additional semantic details significantly enhance the prediction of brain activity in association areas, even in the absence of visual information.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eOur analysis revealed a significant hemispheric lateralization effect among language-based encoding models. Specifically, both MultiHot and BERT-based encoding models exhibited a clear left-hemisphere dominance, with the median r-values 0.06\u0026thinsp;\u0026minus;\u0026thinsp;0.08 higher on the left hemisphere than on the right (Fig.\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e5\u003c/span\u003eD). This pattern was consistently observed across all caption lengths (Figure S5). In contrast, vision-based encoding models (ResNet and ViT) exhibited a bilaterally distributed pattern, suggesting equal contribution from both hemispheres for vision perception. Among all encoding models, BERT exhibited the strongest left-hemisphere dominance, with a median-correlation laterality index (LI) of 0.073, followed by the multimodal (0.067) and MultiHot (0.06), while vision-based models displayed weak or even right-hemisphere lateralization (ViT: LI = -0.04; ResNet: LI = -0.002). The left-hemishphere lateralization of language models was consistently observed across participants, caption lengths (single sentence vs. five-sentence) and text formats (object-label vectors vs. full-sentence captions) (Figure S5).\u003c/p\u003e \u003cp\u003eThese findings align with the typical dominance of the left hemisphere for language and semantic processing, contrasting with the bilateral distribution for sensory processing. This suggests that semantic embeddings play a crucial role in the reconstruction of rich linguistic and contextual information for comprehensive natural-scene understanding.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003eMultimodal Fusion Outperforms Unimodal Models Across the Brain\u003c/h3\u003e\n\u003cp\u003eVision and language models each captured unique components of neural activity during natural-scene viewing. Visual features extracted from CNN and ViT dominated early visual areas (V1-V3), while semantic embeddings from MultiHot and BERT preferentially explained activity in associated areas of frontoparietal and dorsal/ventral attention networks. We therefore hypothesized that the joint embedding of visual and linguistic information would amplify these complementary signals and yield the best prediction of neural activity across the whole brain.\u003c/p\u003e \u003cp\u003eTo test this, we implemented a multimodal encoding model based on ViLT, which was pretrained to align image patches, object labels and text captions. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e4\u003c/span\u003eA and \u003cb\u003eTable S1\u003c/b\u003e, the ViLT-based encoding model significantly outperformed every unimodal model with its median prediction accuracy rose by 3% over BERT, 8% over MultiHot, 28% over ViT, and 30% over ResNet. Crucially, the gain relative to vision-based models was markedly stronger in the left hemisphere (ViLT - ViT: 33% left vs 23% right; ViLT - ResNet: 34% left vs 26% right), whereas language models showed equivalent improvements across both hemispheres (ViLT - BERT: 2.7% left vs 3.4% right; ViLT - MultiHot: 8.4% left vs 7.1% right). The significantly predicted cortical vertices (FDR corrected, p\u0026thinsp;\u0026lt;\u0026thinsp;0.01) extended from the early visual cortex along both dorsal and ventral pathways into perisylvian language areas and frontoparietal control networks (e.g., LOC, FFA, PPA, PHC, hMT+, IPS, SPS). This finding indicates that the cross-modal interaction of visual and linguistic information boosts brain encoding beyond a simple additive benefit of vision-plus-language alone. For instance, ViLT outperformed BERT in high-order visual cortex (V3A/B, hMT+) and surpassed ViT throughout dorsal and ventral streams as well as in frontoparietal and attention networks (Fig.\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e5\u003c/span\u003eA,B). The best-predicted sites were found in the lateral temporal cortex (TO) and parahippocampal cortex (PHC), highlighting the behavioral relevance of multimodal representations for object recognition and scene understanding.\u003c/p\u003e \u003cp\u003eDifference maps of multimodal and unimodal encoding models (Fig.\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e5\u003c/span\u003eA) further dissected the unique contribution of each modality. The visual component (e.g., ViLT - BERT) revealed positive clusters in hMT, MST and OFA, indicating the enhancement of low-level visual details beyond abstract representations derived from language alone. Conversely, the semantic component (e.g., ViLT - ViT) demonstrated widespread gains in frontal and parietal regions, reflecting a global amplification of semantic information in neural encoding over purely visual, low-level image features. Regional encoding performance of individual visual areas using the Kastner atlas confirmed these patterns (Fig.\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e5\u003c/span\u003eB). While ViT\u0026rsquo;s dominance was confined to early visual cortex (V1v, V1d and V2v), language-based models (BERT and MultiHot) dominated the ventral and dorsal visual pathways. Critically, the ViLT model aggregated these advantages by matching ViT in early visual cortex while outperforming other all models elsewhere.\u003c/p\u003e \u003cp\u003eMapping these results onto the Yeo-7 functional networks revealed a clear functional gradient in encoding accuracy (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e4\u003c/span\u003eC), which decayed across brain networks from visual to dorsal-attention, frontoparietal, and default-mode networks. Crucially, within each association network, the language-based and multimodal models systematically outperformed vision-only models. This demonstrates that natural-scene understanding consistently recruits abstract, language-like representations throughout the cortex that supersede purely visual coding. Last but not the least, the ViLT model also exhibited a left-hemisphere lateralization effect (LI of median correlation: 0.067), similar to the language-based models (BERT: 0.074; MultiHot: 0.060), in contrast to the bilaterally distributed patterns in vision-based models (Figure S5). The combined evidence underscores the essential role of incorporating both semantic and visual features for an accurate account of human scene understanding.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003eInter-subject variability is driven by language, not vision\u003c/h3\u003e\n\u003cp\u003eTo quantify how scene-comprehension strategies differ across individuals, we trained vertex-wise encoding models separately for each participant and subsequently computed the inter-subject variance of prediction accuracy. The joint image-text embeddings in ViLT consistently yielded high encoding accuracy in the lateral occipital and temporal cortex, as well as in frontoparietal regions (Fig.\u0026nbsp;\u003cspan refid=\"Fig12\" class=\"InternalRef\"\u003e7\u003c/span\u003e). Crucially, the variability maps revealed a clear spatial dissociation that inter-subject variance was highest in the frontoparietal network and ventral visual pathway (e.g., V4, hMT, MST, FFA, PPA, IPS), while early visual areas (V1-V3) showed relatively low variability. This topography was prominent for both the multimodal (ViLT) and language-based (BERT) models, but was notably absent for the vision-based (ViT) model, which exhibited low variability across the entire cortex, with minor peaks in visual areas and the ventral pathway (Fig.\u0026nbsp;\u003cspan refid=\"Fig11\" class=\"InternalRef\"\u003e6\u003c/span\u003eA) .\u003c/p\u003e \u003cp\u003eFurther correlation analysis of variability maps demonstrated that the inter-subject variability of multimodal encoding (ViLT) was almost perfectly predicted by unimodal language-based models (ViLT vs BERT: r\u0026thinsp;=\u0026thinsp;0.97), but only moderately corrected with vision-only models (ViLT vs ViT: r\u0026thinsp;=\u0026thinsp;0.74). Scatter plots of vertex-wise variance maps confirmed this visual-linguistic dissociation (Fig.\u0026nbsp;\u003cspan refid=\"Fig11\" class=\"InternalRef\"\u003e6\u003c/span\u003eB) such that vertices of BERT-vs-ViLT clustered tightly along the identity line, whereas vertices of ViLT-vs-ViT and Vit-vs-BERT were more broadly distributed, exhibiting a zero-shift at high-variance vertices especially for vertices showing low accuracy in BERT but relatively high accuracy in ViT. These findings suggest that inter-subject differences in natural-scene understanding are primarily driven by their variability in language-mediated, high-level semantic processing, rather than by differences in low-level visual perception.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003eSemantic cortical atlas of objects and concepts\u003c/h3\u003e\n\u003cp\u003eThe semantic encoding models not only demonstrated high predictive power across the entire cortex but also offered strong interpretability and behavioral relevance through semantic mapping. To create a unified semantic cortical atlas of objects and concepts, we trained a single MultiHot encoding model using group-concatenated fMRI data from all participants. The resulting group-level weight matrix (80 COCO \u0026ldquo;thing\u0026rdquo; categories by 163k cortical vertices) was then projected onto low-dimensional semantic axes using principal component analysis (PCA). The first three principal components (PCs), which collectively explained 60% of the total variance, served as a unified color palette for the semantic mapping (Fig.\u0026nbsp;\u003cspan refid=\"Fig13\" class=\"InternalRef\"\u003e8\u003c/span\u003eA). By projecting individual participants\u0026rsquo; cortical vertices onto these semantic axes, we generated a continuous semantic map for each brain (Fig.\u0026nbsp;\u003cspan refid=\"Fig13\" class=\"InternalRef\"\u003e8\u003c/span\u003eB,C), with each PC capturing a distinct semantic gradient. Specifically, PC1 (regions in blue) strongly correlated with the animate-vs-inanimate distinction and clustered within the early visual cortex (V1-V3). PC2 (regions in green) encoded action-related and movable objects, primarily located in areas of the dorsal pathway including the intraparietal sulcus (IPS), superior parietal lobule (SPL) and hippocampus. PC3 (regions in purple) indexed concepts such as person, food, and man-made tools, occupying areas in the ventral pathway including FFA and OFA, and extending into prefrontal cortex. This demonstrates that the low-dimensional semantic embedding of objects and concepts successfully recovered behaviorally meaningful, semantic gradients that respect the functional anatomy of the human visual system.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eIn this study, we leveraged a suite of unimodal (visual-only, language-only) and multimodal (visual-language) deep learning architectures as computational probes to test the Semantic Scaffold hypothesis for human natural-scene understanding. Our principal findings demonstrate that comprehensive scene comprehension critically relies on abstract, language-derived semantic knowledge that extends far beyond the classical visual cortex, challenging the traditional view of semantic processing as a purely downstream visual consequence. We provide strong evidence for the two core components of our framework. First, supporting the dissociation component, while perceptually-driven representations remained largely confined to the visual cortex, language-derived representations robustly predicted activity across expansive association networks. Second, supporting the integration process, we found that multimodal joint embeddings systematically outperformed visual-only models across expansive frontal and temporal association cortices. This superior performance underscores the dynamic interplay between bottom-up sensory processing and top-down semantic grounding, suggesting that visual input is continuously interpreted through existing semantic knowledge structures. Further gradient analysis revealed a unified semantic atlas organized along the animate-inanimate axis, and a reliable left-hemisphere lateralization for high-level semantic integration. Collectively, our study repositions language-derived semantic knowledge as a primary, foundational component organizing cortical representations, advancing a computational roadmap for disentangling the roles of vision and language in shaping the functional topography of the human brain.\u003c/p\u003e\n\u003ch3\u003eSemantic scaffolding: dissociation of visual and abstract semantic features\u003c/h3\u003e\n\u003cp\u003eOur findings provide strong support for the dissociation component of our Semantic Scaffold hypothesis, suggesting a clear functional separation between sensory- and language-derived features. First, we established a baseline by confirming that visual-only encoding models (e.g., ResNet, ViT) consistently replicate the foundational neuro-AI work (Bonnen et al., \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; G\u0026uuml;\u0026ccedil;l\u0026uuml; and Gerven, 2015; Horikawa and Kamitani, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Schrimpf et al., \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Yamins et al., \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2014\u003c/span\u003e). In line with these classic studies, we found that perceptually-driven representations remained largely confined to the visual cortex, supporting the well-established cortical hierarchy for visual processing (V1 to VTC). Neuroscientists initially employed brain encoding models to determine which AI model is most \u0026ldquo;brain-like\u0026rdquo; (Schrimpf et al., \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Following foundational work by (Schrimpf et al., \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Yamins and DiCarlo, \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2016\u003c/span\u003e), studies employed vision models like LeNet, AlexNet, and VGGnet to accurately predict fMRI activity and reconstruct visual stimuli in areas V1 through IT (G\u0026uuml;\u0026ccedil;l\u0026uuml; and Gerven, 2015; Horikawa and Kamitani, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Nishimoto et al., \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2011\u003c/span\u003e; Shen et al., \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Our results confirmed this classic view of human visual system that perceptually-driven representations remained largely confined to the visual cortex, supporting the cortical hierarchy for low-level visual processing (V1 to V4/LO).\u003c/p\u003e \u003cp\u003eIn sharp contrast, both language-based models (e.g., MultiHot, BERT) and multimodal joint embeddings (ViLT) substantially outperformed visual-only models across extensive frontal and temporal association cortices. This dichotomy strongly supports the evolving perspective that scene interpretation relies not solely on sensory perception, but also on abstract, language-derived representations (Bi, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Lupyan et al., \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Our results establish a distinct semantic encoding pathway in frontal and temporal lobes, robustly dissociable from purely visual perception. These association cortices, crucial for conceptual retrieval and knowledge representation, show a profound preference for abstract semantic features. This semantic knowledge structure, effectively captured by both complex contextual embeddings of BERT and even minimal categorical tags of Multi-Hot vectors, enables these encoding models to maintain their predictive power where purely image features lose efficacy. This finding resonates deeply seminal work demonstrating that concepts are represented in areas far beyond the visual cortex, often overlapping with neural representations of auditory and linguistic stimuli (LeBel et al., \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Nishida and Nishimoto, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Popham et al., \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). Together, our findings establish that this distinct, high-level semantic pathway acts as a crucial cognitive scaffold for sensory perception.\u003c/p\u003e\n\u003ch3\u003eMultimodal integration and top-down semantic guidance\u003c/h3\u003e\n\u003cp\u003eThe superior performance of the multimodal ViLT model relative to its unimodal counterparts provides direct empirical validation for the integration component of our Semantic Scaffold hypothesis. This finding confirms that multimodal fusion offers distinct advantages for the accurate mapping of complex brain activity, aligning with advances in AI vision-language transformers like VisualBERT (Li et al., \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2019\u003c/span\u003e), CLIP (Radford et al., \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2021\u003c/span\u003e), and ViLT (Dosovitskiy et al., \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Our results go a step further, suggesting a plausible neural mechanism where top-down semantic knowledge contextually modulates and refines the interpretation of incoming visual information. By fusing visual patches with semantic tokens, ViLT captures the dynamic interplay between bottom-up sensory perception and top-down cognitive influence. The superior performance of this integrated approach suggests that comprehensive scene understanding inherently requires abstract, language-derived knowledge to contextually modulate and refine the interpretation of incoming visual input, in line with Predictive Coding theories (Millidge et al., \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Salvatori et al., \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). This crucial role positions top-down semantic guidance not as merely supplementary, but as a critical and continuous factor in achieving robust unified perception. These findings, therefore, provide empirical support that the brain relies on a similar, integrated computational mechanism, as conceptualized by our scaffold framework.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eThe nature of the Semantic Scaffold: unified semantic atlas and left-hemisphere lateralization\u003c/h2\u003e \u003cp\u003eWe have established the dissociation and integration components of our Semantic Scaffold hypothesis. Next, we sought to understand its internal structure. Further analysis of the language-based models confirmed the content and organization of this semantic scaffolding system. Multi-Hot encoding, derived from remarkably simple categorical tags, robustly recovered the principal axes of semantic organization, defining a dominant animate-vs-inanimate axis, complemented by action-related and man-made dimensions. These semantic gradients directly replicate the core structure that consistently recovered in prior semantic mapping studies (Huth et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2016\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2012\u003c/span\u003e), which is universally recognized as the principal dimension of semantic organization. This finding aligns with identical components observed in semantic maps derived from fMRI responses to natural movies (Huth et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) and the cortical hierarchy of auditory-linguistic atlases (Doerig et al., \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2025\u003c/span\u003e; Popham et al., \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Wang et al., \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Likewise, the action-related dimension aligns with the separation of verbs/actions from nouns/objects, often mapped to dorsal visual stream involved in motion and manipulation, and the man-made/civilization dimension corresponds to a third major gradient that differentiates tools and manufactured items (Bonner and Epstein, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Mitchell et al., \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2008\u003c/span\u003e). The efficacy of the Multi-Hot encoding model confirms that while complex language models like BERT capture fine-grained contextual nuances, the core organizational scaffold of human semantic network is fundamentally categorical, supporting a robust, unified semantic atlas of the human cortex.\u003c/p\u003e \u003cp\u003eFuthermore, this abstract semantic system exhibits a robust functional asymmetry with notable left-hemisphere lateralization for both the language-based (BERT and MultiHot) and multimodal encoding models. This strong lateralization, consistently observed across participants and invariant to text format and caption lengths, aligns with the established left-hemisphere dominance for language comprehension and higher-level semantic memory (Fedorenko et al., \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2011\u003c/span\u003e; Malik-Moraleda et al., \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Our findings connect directly to recent work with Large Language Models (LLMs) showing that this asymmetry emerges and strengthens with increasing model complexity (Antonello et al., \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Doerig et al., \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2025\u003c/span\u003e; Grand et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Crucially, this dominance holds even when using abstract Multi-Hot categorical tags instead of continuous semantics, suggesting the left hemisphere is intrinsically biased toward high-level conceptual integration, irrespective of inputs\u0026rsquo; complexity or sensory modality. This left-hemisphere specialization provides a structural basis for the top-down cognitive influence observed in the multimodal model, explaining how semantic features effectively ground the visual input. Therefore, the left-hemisphere lateralization of semantic encoding is not merely a linguistic artifact, but a fundamental reflection of the brain\u0026rsquo;s highly contextual and relational semantic structure.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eLimitations and Future Directions\u003c/h2\u003e \u003cp\u003eThe interpretation of this study is subject to several limitations. First, our findings are based on the intensive, deep-sampling, small-cohort design of the NSD dataset. While this approach is powerful for building robust individual-subject models, the small sample size limits the generalizability of our findings and hinders a comprehensive characterization of inter-subject variability in semantic topography. Future work with larger cohorts is necessary to fully validate the precise spatial organization of these semantic maps across the general population. Second, our study used a static design (a large set of isolated images). This approach was sufficient to establish the existence of the Semantic Scaffold framework, as well as its dissociated pathways (visual vs. language) and their eventual integration (the superior performance of ViLT). However, this may limit the overall power to model the dynamic, contextual integration that occurs in the real world. The next crucial step is to extend this analysis to dynamic, naturalistic stimuli (videos with speech/dialog) and use time-resolved encoding models to test how the multimodal advantage shifts when visual and linguistic content is temporally coordinated and causally linked. This dynamic approach is essential for moving from a static map of the scaffold to a full mechanistic model of multimodal semantic binding.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis work challenges the traditional, visual-centric model of scene understanding. We propose the Semantic Scaffold framework, which posits that language-derived knowledge acts as a foundational component for visual perception, not just a downstream consequence. Our results establish the two core components of this framework. First, a fundamental functional dissociation between perceptually-driven visual representations (confined to the visual cortex) and a distinct, abstract semantic pathway (in frontal and temporal lobes). Second, an integration process, evidenced by the widespread superior performance of multimodal models, demonstrating that these two pathways converge to form a unified, coherent perception. Our findings demonstrate that language-derived semantic knowledge is not a passive, secondary feature, but rather an active scaffold that contextually modulates and refines incoming sensory input. Furthermore, we characterized the nature of this scaffold, revealing a unified semantic atlas organized by a dominant animate-inanimate axis and a robust left-hemisphere lateralization. Collectively, our findings advance an integrated computational mechanism for scene understanding, repositioning language-derived knowledge as a primary component in how humans build a coherent perception of the world.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eParticipants and Dataset\u003c/h2\u003e \u003cp\u003eWe utilized the publicly available Natural Scenes Dataset (NSD) (Allen et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2021\u003c/span\u003e), which comprises whole-brain 7T fMRI scans from eight healthy participants. The NSD paradigm involved 40 sessions per participant, each viewing 10,000 natural scenes over a one-year period. Among with, a fixed set of 1,000 images was repeated for every participant and reserved for model testing. The remaining 9,000 images were unique to each participant and used for training subject-specific vertex-wise encoding models. During fMRI scanning protocol, each image was presented for 3 seconds, followed by a 1-second inter-stimulus interval. For the present analysis, we included only the four participants (subj01, subj02, subj05, subj07) who completed all 40 sessions. All stimuli of natural scenes were drawn from the Common Objects in Context (COCO) dataset (Lin et al., \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2014\u003c/span\u003e), a standard benchmark for object detection, instance/semantic segmentation, and key-point estimation. In addition to the scene images, we used the corresponding COCO metadata, specifically the 80 \u0026ldquo;thing\u0026rdquo; categories (common objects like person, bicycle, elephant, pizza, etc.) and five human-generated captions (10 to 20 words per caption) of each scene. These captions and object labels served as the linguistic inputs to our language and multimodal encoding models.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003efMRI Data Acquisition and Preprocessing\u003c/h2\u003e \u003cp\u003eFunctional MRI data were acquired at a 7-Tesla field strength with a high spatial resolution of 1.8 mm isotropic voxels. Standard preprocessing steps were performed using fmriprep pipeline v24.0.0 (Esteban et al., \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2019\u003c/span\u003e), including slice timing correction, head motion correction, co-registration of the functional images to the participant\u0026rsquo;s T1w image, and nuisance regression of motion parameters, white matter (WM), and cerebrospinal fluid (CSF) signals. The functional images were subsequently normalized to the ICBM152 template by combining the linear functional-to-structural transformation with the nonlinear warpping from individual structural space to the MNI space.\u003c/p\u003e \u003cp\u003eTo accurately estimate the brain response to each short-duration (3 s) scene presentation, which is challenging due to low signal-to-noise ratio (SNR) and highly overlapping hemodynamic response function (HRF) effects in rapid event-related designs, we employed the GLMsingle model (Prince et al., \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2022\u003c/span\u003e), a specialized toolbox designed to robustly estimate single-trial beta-values, representing the fMRI response to each scene image. Specifically, we first estimated the optimal voxel-specific HRFs using a library of 20 HRF basis functions by selecting the best fit of BOLD signals for the current voxel. Then, the resulting HRF index map was denoised using the single-trial nuisance regression incorporating head motion parameters (Kay et al., \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2013\u003c/span\u003e). After that, the fractional ridge regression model was applied to disassociate the contribution of each trial to the measured BOLD signals and estimate the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\beta\\:\\)\u003c/span\u003e\u003c/span\u003e values of each trial, representing fMRI response to each scene image. The tradeoff between the regularized and unregularized coefficients in the model was controlled by a fixed ratio γ.\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:{\\widehat{\\beta\\:}}^{RR}={}_{\\beta\\:}{}^{argmin}({‖y-\\left(X\\ast\\:{f}_{hrf}^{\\ast\\:}\\right)\\beta\\:‖}^{2}+{‖\\beta\\:‖}^{2})$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$\\:{\\widehat{\\beta\\:}}^{OLS}={}_{\\beta\\:}{}^{argmin}\\left({‖y-\\left(X\\ast\\:{f}_{hrf}^{\\ast\\:}\\right)\\beta\\:‖}^{2}\\right)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:y\\)\u003c/span\u003e\u003c/span\u003e is the measured BOLD time-series; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:X\\)\u003c/span\u003e\u003c/span\u003e is the design matrix of visual stimuli by merging a series of delta functions with 1 indicating the duration of a specific event and 0 for other time points; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\widehat{{\\beta\\:}}}^{RR}\\)\u003c/span\u003e\u003c/span\u003e is the best fit with regularized coefficients; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\widehat{{\\beta\\:}}}^{OLS}\\)\u003c/span\u003e\u003c/span\u003e is the estimated \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\beta\\:\\)\u003c/span\u003e\u003c/span\u003e without regularization. Then, the actual brain response \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\widehat{\\beta\\:}}^{\\ast\\:}\\)\u003c/span\u003e\u003c/span\u003e was calculated as:\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:\\gamma\\:=\\frac{{‖{\\widehat{\\beta\\:}}^{RR}‖}_{2}}{{‖{\\widehat{\\beta\\:}}^{OLS}‖}_{2}}$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$$\\:{\\widehat{\\beta\\:}}^{\\ast\\:}={}_{\\beta\\:}{}^{argmin}({‖y-\\left(X\\ast\\:{f}_{hrf}^{\\ast\\:}\\right)\\beta\\:‖}^{2}+\\gamma\\:{‖\\beta\\:‖}^{2})$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe resulting voxel-wise maps of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\widehat{\\beta\\:}}^{\\ast\\:}\\)\u003c/span\u003e\u003c/span\u003e were then projected onto the fsaverage cortical surface template (Fischl et al., \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2004\u003c/span\u003e), resulting in a vertex-wise brain response map (163,842 vertices per hemisphere) for each scene image. These vertex-wise brain maps served as the target variable for all subsequent encoding model analyses.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eVertex-wise Encoding Model\u003c/h2\u003e \u003cp\u003eWe implemented a vertex-based encoding framework to predict single-trial fMRI responses of natural scenes using representations derived from computer vision, language and multimodal deep-learning architectures. This framework links stimulus features to cortical activity through a two-stage process. Firstly, all types of stimulus content, like scene images or the corresponding object labels or text captions, were embedded in a feature vector of latent representations extracted from various computational models (i.e., feature embedding). Secondly, a subject-specific, vertex-wise ridge regression model was trained to predict the measured single-trial fMRI responses using the extracted latent features (ridge regression).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eFeature embedding\u003c/h2\u003e \u003cp\u003eFor each single-trial stimulus, we located the scene image \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{I}_{i}\\)\u003c/span\u003e\u003c/span\u003e and its corresponding text captions \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{i}\\)\u003c/span\u003e\u003c/span\u003e from the COCO dataset, and extracted various types of latent feature vectors in terms of image patches, object categories, text captions, and joint embeddings of both image and text, which are formulated as follows:\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$$\\:{z}_{i}^{V}={f}_{V}\\left({I}_{i}\\right)$$\u003c/div\u003e\u003c/div\u003e;\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$$\\:{z}_{i}^{T}={f}_{T}\\left({T}_{i}\\right)$$\u003c/div\u003e\u003c/div\u003e;\u003cdiv id=\"Eque\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Eque\" name=\"EquationSource\"\u003e\n$$\\:{z}_{i}^{\\text{M}\\text{u}\\text{l}\\text{t}\\text{i}\\text{H}\\text{o}\\text{t}}={f}_{\\text{M}\\text{u}\\text{l}\\text{t}\\text{i}\\text{H}\\text{o}\\text{t}}\\left({T}_{i}\\right)$$\u003c/div\u003e\u003c/div\u003e;\u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e\n$$\\:{z}_{i}^{\\text{J}\\text{o}\\text{i}\\text{n}\\text{t}}={f}_{\\text{J}\\text{o}\\text{i}\\text{n}\\text{t}}\\left({I}_{i},\\:{T}_{i}\\right)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{V}\\)\u003c/span\u003e\u003c/span\u003e represents the image feature vector, extracted by implementing a computer vision model \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{f}_{V}\\)\u003c/span\u003e\u003c/span\u003e on the scene image \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{I}_{i}\\)\u003c/span\u003e\u003c/span\u003e; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{T}\\)\u003c/span\u003e\u003c/span\u003e represents the text embedding of the \u003cem\u003ei-\u003c/em\u003eth image, extracted by applying a language model \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{f}_{T}\\)\u003c/span\u003e\u003c/span\u003e on the text captions \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{i}\\)\u003c/span\u003e\u003c/span\u003e; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{\\text{M}\\text{u}\\text{l}\\text{t}\\text{i}\\text{H}\\text{o}\\text{t}}\\)\u003c/span\u003e\u003c/span\u003e represents the multi-hot embedding of the \u003cem\u003ei-\u003c/em\u003eth image, indicating whether a specific object out of 80 \u0026ldquo;thing\u0026rdquo; categories has appeared in the image; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{\\text{J}\\text{o}\\text{i}\\text{n}\\text{t}}\\)\u003c/span\u003e\u003c/span\u003e represents the multimodal joint embeddings of image and text for the \u003cem\u003ei-\u003c/em\u003eth image, generated by a multimodal model \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{f}_{Joint}\\)\u003c/span\u003e\u003c/span\u003ethat jointly processes the image \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{I}_{i}\\)\u003c/span\u003e\u003c/span\u003e and text captions \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{i}\\)\u003c/span\u003e\u003c/span\u003e. These four types of latent features, i.e., image features (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{V}\\)\u003c/span\u003e\u003c/span\u003e), text features (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{T}\\)\u003c/span\u003e\u003c/span\u003e), multihot labels (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{\\text{M}\\text{u}\\text{l}\\text{t}\\text{i}\\text{H}\\text{o}\\text{t}}\\)\u003c/span\u003e\u003c/span\u003e), and image-text joint embeddings (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{\\text{J}\\text{o}\\text{i}\\text{n}\\text{t}}\\)\u003c/span\u003e\u003c/span\u003e), constituted the input regressors for the subsequent ridge regression models.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eRidge Regression\u003c/h2\u003e \u003cp\u003eWe implemented the kernel ridge regression to learn a linear mapping \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:W\\)\u003c/span\u003e\u003c/span\u003e between the feature embeddings \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:Z=\\left\\{{z}_{\\text{i}}\\right\\}\\)\u003c/span\u003e\u003c/span\u003e and the observed vertex-wise brain responses β. The model was trained by minimizing the following objective function:\u003cdiv id=\"Equ4\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ4\" name=\"EquationSource\"\u003e\n$$\\:{\\mathcal{ℒ}}_{\\text{W}}\\left(\\text{Z}\\right)=\\:{‖{\\beta\\:}-\\:\\text{Z}\\text{W}‖}^{2}+\\:{\\lambda\\:}{‖\\text{W}‖}^{2}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e4\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:Z\\in\\:{\\mathbb{ℝ}}^{N\\times\\:D}\\)\u003c/span\u003e\u003c/span\u003e represents the feature embedding matrix for \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:N\\)\u003c/span\u003e\u003c/span\u003e training stimuli with \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{D}\\)\u003c/span\u003e\u003c/span\u003e dimensional features; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\beta\\:\\in\\:{\\mathbb{ℝ}}^{N\\times\\:S}\\)\u003c/span\u003e\u003c/span\u003e represents the corresponding brain responses for \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:S\\)\u003c/span\u003e\u003c/span\u003e cortical vertices across \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:N\\)\u003c/span\u003e\u003c/span\u003e training stimuli; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:W\\in\\:{\\mathbb{ℝ}}^{D\\times\\:S}\\)\u003c/span\u003e\u003c/span\u003e is the weight matrix of the ridge regression, that projects feature embeddings onto brain responses; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\lambda\\:\\:\\)\u003c/span\u003e\u003c/span\u003eis the regularization hyperparameter; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\beta\\:\\)\u003c/span\u003e\u003c/span\u003e is the cortical version of estimated brain response \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\widehat{\\beta\\:}}^{\\ast\\:}\\:\\)\u003c/span\u003e\u003c/span\u003ein Eq.\u0026nbsp;\u003cspan refid=\"Equ2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. Subject-specific encoding models were trained separately for each participant and each vertex, using their unique 9,000 training images. The optimal hyperparameter \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\lambda\\:\\:\\)\u003c/span\u003e\u003c/span\u003ewas selected via 10-fold cross-validation on the training set. We quantified the each encoding model by computing Pearson\u0026rsquo;s correlation coefficients (\u003cem\u003er\u003c/em\u003e) between the predicted (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:ZW\\)\u003c/span\u003e\u003c/span\u003e) and observed (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\beta\\:\\)\u003c/span\u003e\u003c/span\u003e) fMRI responses across 1,000 test scene images. Additionally, for multi-hot word embedding \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{\\text{M}\\text{u}\\text{l}\\text{t}\\text{i}\\text{H}\\text{o}\\text{t}}\\)\u003c/span\u003e\u003c/span\u003e, we decomposed the weight matrix \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{W}_{label}\\)\u003c/span\u003e\u003c/span\u003e into different principal components that define the latent semantic gradients of the 80 object-category representations.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eComputational Models for Feature Embedding\u003c/h2\u003e \u003cp\u003eOur encoding framework employed three distinct families of deep learning architectures, including Vision, Language, and Multimodal fusion models, to generate feature embeddings for predicting fMRI activity. For vision-based models, we utilized two prominent architectures, ResNet and Vision Transformer (ViT), to extract latent features of natural scenes across different scales and levels of abstraction. The language-based models, Multi-hot Encoding (MultiHot) and Bidirectional Encoder Representations from Transformers (BERT), captured semantic and linguistic representations of the scene images. Finally, a dedicated multimodal fusion approach was implemented via the Vision-and-Language Transformer (ViLT) architecture.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003eVisual feature embeddings\u003c/h2\u003e \u003cp\u003e \u003cb\u003eResNet\u003c/b\u003e (He et al., \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) is a canonical Convolutional Neural Network (CNN) that uses skip-residual connections to mitigate gradient vanishing/explosion in deep neural networks. We used an ImageNet-pretrained ResNet-50 architecture and extracted layer activations from its four major residual blocks, yielding a multi-scale hierarchy of visual feature embeddings. Considering the distinct sizes of activation tensors across residual blocks (Block-1: 56 * 56 * 256; Block-2: 28 * 28 * 512; Block-3: 14 * 14 * 1024 ; Block-4: 7 * 7 * 2048), we flattened the activation tensor from each residual block and randomly sampled a fixed 100 k-dimensional vector \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{V}\\)\u003c/span\u003e\u003c/span\u003e to obtain a uniform representation for each scene image.\u003c/p\u003e \u003cp\u003e \u003cb\u003eVision Transformer\u003c/b\u003e (ViT) (Dosovitskiy et al., \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) treats a input image as a sequence of fixed-size image patches, linearly embeds each patch, and processes latent features with a standard Transformer encoder. We used a pretrained ViT-B/16 model and extracted the final-layer patch tokens and the final CLS token, yielding a 768-dimensional feature embedding vector \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{z}}_{\\text{i}}^{\\text{V}}\\)\u003c/span\u003e\u003c/span\u003e for each scene image.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003eSemantic embedding and language models\u003c/h2\u003e \u003cp\u003e \u003cb\u003eMulti-hot Encoding\u003c/b\u003e encodes the presence or absence of each of 80 COCO \u0026ldquo;thing\u0026rdquo; categories as an 80-dimensional binary vector \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{\\text{M}\\text{u}\\text{l}\\text{t}\\text{i}\\text{H}\\text{o}\\text{t}}\\)\u003c/span\u003e\u003c/span\u003e, whose entries indicate either presence (1) or absence (0) of the corresponding category. By mapping the continuous image-pixel space onto a discrete object-label space, this embedding yields a low-dimensional abstract semantic description of scene images. This approach has previously been used to model the continuous semantic mapping in the human cortex (Huth et al., \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2012\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cb\u003eBERT\u003c/b\u003e (Devlin et al., \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) is a bidirectional Transformer encoder pretrained on Wikipedia, yielding context-sensitive embeddings for every language token. For each scene image, we retrieved the associated five sentences of text captions provided by COCO, pooled them into a single text block, tokenized the block, and extracted the CLS token vector from the pretrained BERT model, resulting in a 768-dimensional linguistic signature of each scene image\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\:{\\text{z}}_{\\text{i}}^{\\text{T}}\\)\u003c/span\u003e\u003c/span\u003e. To specifically determine how caption length shapes this semantic representation, we repeated the procedure using the shortest, medium-length, or longest single captions, as well as the full five-sentence caption set.\u003c/p\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003ch2\u003eMultimodal joint embedding of image and text\u003c/h2\u003e \u003cp\u003eThe multimodal feature space was modeled using the \u003cb\u003eVision-and-Language Transformer\u003c/b\u003e (ViLT) (Kim et al., \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) that uses a single shared Transformer stack to jointly process image patches and caption tokens. The ViLT's architecture integrates three components: image-patch embeddings (from a ViT encoder), word embeddings (from a BERT tokenizer), and cross-modal self-attention layers. The model was originally trained on large-scale datasets, including COCO, using three primary objectives: Image-Text Matching (ITM), Masked Language Modeling (MLM), and Word-Patch Alignment (WPA), along with three major components: image-patch embeddings from a ViT encoder, word embeddings from a BERT tokenizer, and cross-modal self-attention layers. Image-Text Matching (ITM) aims to align the joint embedding space by distinguishing matched image-text pairs from mismatched (negative) pairs. We then sampled from the image-caption pairs and computed negative log-likelihood loss.\u003cdiv id=\"Equ5\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ5\" name=\"EquationSource\"\u003e\n$$\\:{\\mathcal{ℒ}}_{\\text{I}\\text{T}\\text{M}}\\left({\\theta\\:}\\right)=\\:-{\\mathbb{E}}_{\\left({\\text{I}}_{\\text{i}},{\\:\\text{T}}_{\\text{j}}\\right)\\sim\\text{D}}(\\text{ylog}{\\text{s}}_{{\\theta\\:}}\\left({\\text{I}}_{\\text{i}},{\\text{T}}_{\\text{j}}\\right)+\\left(1-\\text{y}\\right)\\text{log}\\left(1\\:{-\\text{s}}_{{\\theta\\:}}\\right({\\text{I}}_{\\text{i}},{\\text{T}}_{\\text{j}})\\left)\\right)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e5\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{s}}_{{\\theta\\:}}\\left({\\text{I}}_{\\text{i}},{\\text{T}}_{\\text{j}}\\right)\\)\u003c/span\u003e\u003c/span\u003e indicates the alignment of image and text embeddings, measured by cosine similarity, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{y}\\in\\:\\{0,\\:1\\}\\)\u003c/span\u003e\u003c/span\u003e indicates whether the sampled image-caption pair \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\left({\\text{I}}_{\\text{i}},{\\text{T}}_{\\text{j}}\\right)\\:\\)\u003c/span\u003e\u003c/span\u003eis matched or not. Masked language modeling (MLM) is used to predict the masked words \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{T}}_{\\text{i},\\text{m}}\\)\u003c/span\u003e\u003c/span\u003e based on the surrounding context \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{T}}_{\\text{i},\\backslash\\:\\text{m}}\\)\u003c/span\u003e\u003c/span\u003eand all image patches \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{I}}_{\\text{i},:}\\)\u003c/span\u003e\u003c/span\u003e, minimizing the negative log-likelihood:\u003cdiv id=\"Equ6\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ6\" name=\"EquationSource\"\u003e\n$$\\:{\\mathcal{ℒ}}_{\\text{M}\\text{L}\\text{M}}\\left({\\theta\\:}\\right)=\\:-{\\mathbb{E}}_{\\left({\\text{I}}_{\\text{i}},{\\:\\text{T}}_{\\text{i}}\\right)\\sim\\text{D}}\\text{log}{\\text{P}}_{{\\theta\\:}}\\left({\\text{T}}_{\\text{i},\\text{m}}\\right|{\\text{T}}_{\\text{i},\\backslash\\:\\text{m}},\\:{\\text{I}}_{\\text{i},:})$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e6\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\theta\\:}\\)\u003c/span\u003e\u003c/span\u003e is the trainable parameters.\u003c/p\u003e \u003cp\u003eHere, we leveraged the pretrained ViLT-B/16 model to obtain the joint image-text embeddings of natural scenes. Each scene image was paired with its five-sentence text captions and a multihot vector of object labels present in the scene. The concatenated inputs were fed into the ViLT model, yielding a 768-dimensional multimodal feature vector from the final CLS token as the joint embedding vector \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}_{i}^{\\text{J}\\text{o}\\text{i}\\text{n}\\text{t}}\\)\u003c/span\u003e\u003c/span\u003e. To probe how semantic content modulates this joint representation, we repeated the procedure while varying the textural inputs, such as the shortest, medium-length, or longest single captions, or the multi-hot object-label vector alone.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003eRegion-of-interest definition\u003c/h2\u003e \u003cp\u003eWe localized cortical visual areas with the Kastner atlas (Wang et al., \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2015b\u003c/span\u003e), \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003ea\u003c/span\u003e probabilistic parcellation map derived from high-resolution 7T fMRI retinotopic, visuotopic and attention-mapping data. The atlas delineates 25 topographically organized areas, spanning regions primary regions (V1-V3), extrastriate cortex (V3A/B, V4), ventral occipital areas (VO1/2), parahippocampal areas (PHC1/2), lateral occipital areas(LO1/2), temporal occipital areas (TO1/2, encompassing hMT+), intraparietal sulcus areas (IPS0-5), frontal eye field (FEF) and supplementary eye fields (SEF), registered to the fsaverage surface and thresholded at 25%. This yield a set of surface-based probability masks that preserve fine-scale topological boundaries while accounting for inter-subject variability in brain anatomy, enabling region-of-interest analyses precisely aligned to functional visuotopic topography rather than anatomical gyral landmarks.\u003c/p\u003e \u003cdiv id=\"Sec25\" class=\"Section3\"\u003e \u003ch2\u003eEvaluation of encoding models\u003c/h2\u003e \u003cp\u003eFor each cortical vertex, we quantified the prediction accuracy of encoding models using the Pearson correlation coefficient (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:r\\)\u003c/span\u003e\u003c/span\u003e) between predicted and observed brain responses across the held-out test set of 1,000 shared scene images.\u003cdiv id=\"Equ7\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ7\" name=\"EquationSource\"\u003e\n$$\\:{r}_{i}=\\frac{cov(\\widehat{{\\beta\\:}_{:,i\\:}},\\:{\\beta\\:}_{:,i})}{{\\sigma\\:}_{\\widehat{{\\beta\\:}_{:,i}}}{\\sigma\\:}_{{\\beta\\:}_{:,i}}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e7\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\widehat{{\\beta\\:}_{:,i\\:}}\\)\u003c/span\u003e\u003c/span\u003e denotes the brain response predicted from visual or text embeddings, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\beta\\:}_{:,i}\\)\u003c/span\u003e\u003c/span\u003e denotes the corresponding measured fMRI response provided by the NSD dataset. This vertex-wise correlation coefficient is a standard metric in previous neuro-AI studies for assessing the correspondence between artificial neural networks and biological brain activity (G\u0026uuml;\u0026ccedil;l\u0026uuml; and Gerven, 2015; Horikawa and Kamitani, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Schrimpf et al., \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2020\u003c/span\u003e), with a value approaching 1 indicating a near-perfect prediction of neural activity. To correct for multiple comparisons across the large number of cortical vertices (163,842 vertices in the fsaverage surface), we applied a False Discovery Rate (FDR) correction to the prediction accuracy maps. Only vertices with FDR corrected, p\u0026thinsp;\u0026lt;\u0026thinsp;0.01 were retained as significant predictions in the encoding models. The overall encoding performance, as well as detailed information of the used feature embeddings, are summarized in Table S1.\u003c/p\u003e \u003cp\u003eFurthermore, we evaluated the hemispheric lateralization effect for each encoding model by calculating a laterality index \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{L}\\text{I}=({r}_{L}-{r}_{R})/({r}_{L}+{r}_{R})\\)\u003c/span\u003e\u003c/span\u003e, where \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{r}_{L}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{r}_{R}\\)\u003c/span\u003e\u003c/span\u003e are the mean correlation of the top 1% vertices in the left and right hemispheres, respectively. The LI index ranges from \u0026minus;\u0026thinsp;1 (right dominant) to +\u0026thinsp;1 (left dominant).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section3\"\u003e \u003ch2\u003eInter-subject variability analysis\u003c/h2\u003e \u003cp\u003eTo determine whether visual or linguistic factors drive individual differences in multimodal fused model, we trained subject-specific encoding models and estimated the inter-subject variability across the four participants. Specifically, the vertex-wise prediction accuracy maps were first Fisher-z-transformed, yielding one z-map per model per subject. We computed the across-subject variance of these z-maps and subsequently correlated the resulting variance maps between the visual, linguistic, and multimodal encoding models. This enables us to quantify the extent to which shared variability in the multimodal model is driven by its constituent visual or linguistic components.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec27\" class=\"Section3\"\u003e \u003ch2\u003eSemantic gradient analysis\u003c/h2\u003e \u003cp\u003eMulti-hot object-label encoding model effectively captures high-level semantic structure of objects and concepts in the natural scenes (Huth et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2016\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2012\u003c/span\u003e). We therefore used them to derive a unified group-level cortical semantic atlas across subjects. Firstly, we concatenated fMRI data of all participants after projecting their individual brain responses onto the fsaverage5 surface template. A group-level category-by-vertex weight matrix was then trained using the Kernel Ridge Regression with 10-fold cross validation. Next, we applied the Singular Value Decomposition (SVD) to the group-level weight matrix and extracted the first three principal components (PC), which jointly explained 60% of the total variance. These PCs define orthogonal \u0026ldquo;semantic gradients\u0026rdquo; spanning the cortical surface and establish a unified color palette for representing object semantics. Finally, each participant\u0026rsquo;s individual weight matrix was projected onto these axes by multiplying it with the PC loadings, yielding a continuous semantic map in the cortex for each subject. This approach enables the clear and intuitive visualization of complex semantic relationships of objects and concepts, and facilitates the comparison of the cortical semantic atlas across individual participants.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAcknowledgment\u003c/h2\u003e\n\u003cp\u003eThis work was partially supported by the STI2030-Major Projects 2021ZD0200200, 2022ZD0211500, the National Natural Science Foundation of China (Grant Nos. 62201519, 52307259, 62327805, 82151307,82202253).\u003c/p\u003e\n\u003ch2\u003eAuthor contributions\u003c/h2\u003e\n\u003cp\u003eConceptualization: YZ; Methodology: YZ; Visualization: ZHY,YXT, YZ;\u003c/p\u003e\n\u003cp\u003eData analysis: JZ, ZHY,YXT, YZ;\u003c/p\u003e\n\u003cp\u003eInvestigation: ZHY, YXT, YZ, JZ, WYY, TQ, SYL;\u003c/p\u003e\n\u003cp\u003eWriting—original draft: TZJ, YFH, JGD, SYL, YZ;\u003c/p\u003e\n\u003cp\u003eWriting—review \u0026amp; editing: TZJ, YFH, JGD, SYL, YZ;\u003c/p\u003e\n\u003ch2\u003eCompeting interests\u003c/h2\u003e\n\u003cp\u003eThe authors declare no competing financial interests.\u003c/p\u003e\n\u003ch2\u003eData and Code availability statement\u003c/h2\u003e\n\u003cp\u003eThe Natural Scenes Dataset (NSD) and COCO datasets are public and accessible to all researchers. The 7T fMRI dataset from NSD can be accessed via https://naturalscenesdataset.org/. The natural scene images and the corresponding object categories and text captions from COCO can be downloaded from https://cocodataset.org/dataset/home.htm. All unimodal and multimodal encoding models and analysis code will be made available upon request to ensure reproducibility.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAllen, E.J., St-Yves, G., Wu, Y., Breedlove, J.L., Prince, J.S., Dowdle, L.T., Nau, M., Caron, B., Pestilli, F., Charest, I., Hutchinson, J.B., Naselaris, T., Kay, K.: A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nat. Neurosci. 1\u0026ndash;11 (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41593-021-00962-x\u003c/span\u003e\u003cspan address=\"10.1038/s41593-021-00962-x\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAntonello, R., Vaidya, A., Huth, A.G.: Scaling laws for language encoding models in fMRI. (2024). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2305.11863\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2305.11863\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBi, Y.: Dual coding of knowledge in the human brain. Trends Cogn. Sci. \u003cb\u003e25\u003c/b\u003e, 883\u0026ndash;895 (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.tics.2021.07.006\u003c/span\u003e\u003cspan address=\"10.1016/j.tics.2021.07.006\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBonnen, T., Yamins, D.L.K., Wagner, A.D.: When the ventral visual stream is not enough: A deep learning account of medial temporal lobe involvement in perception. Neuron. \u003cb\u003e109\u003c/b\u003e, 2755\u0026ndash;2766e6 (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.neuron.2021.06.018\u003c/span\u003e\u003cspan address=\"10.1016/j.neuron.2021.06.018\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBonner, M.F., Epstein, R.A.: Object representations in the human brain reflect the co-occurrence statistics of vision and language. Nat. Commun. \u003cb\u003e12\u003c/b\u003e, 4081 (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41467-021-24368-2\u003c/span\u003e\u003cspan address=\"10.1038/s41467-021-24368-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDevlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2019). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.1810.04805\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.1810.04805\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDiCarlo, J.J., Zoccolan, D., Rust, N.C.: How Does the Brain Solve Visual Object Recognition? Neuron. \u003cb\u003e73\u003c/b\u003e, 415\u0026ndash;434 (2012). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.neuron.2012.01.010\u003c/span\u003e\u003cspan address=\"10.1016/j.neuron.2012.01.010\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDoerig, A., Kietzmann, T.C., Allen, E., Wu, Y., Naselaris, T., Kay, K., Charest, I.: High-level visual representations in the human brain are aligned with large language models. Nat. Mach. Intell. \u003cb\u003e7\u003c/b\u003e, 1220\u0026ndash;1234 (2025). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s42256-025-01072-0\u003c/span\u003e\u003cspan address=\"10.1038/s42256-025-01072-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.: An image is worth 16x16 words: Transformers for image recognition at scale. (2020). arXiv preprint arXiv:2010.11929, 无.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEsteban, O., Markiewicz, C.J., Blair, R.W., Moodie, C.A., Isik, A.I., Erramuzpe, A., Kent, J.D., Goncalves, M., DuPre, E., Snyder, M., Oya, H., Ghosh, S.S., Wright, J., Durnez, J., Poldrack, R.A., Gorgolewski, K.J.: fMRIPrep: a robust preprocessing pipeline for functional MRI. Nat. Methods. \u003cb\u003e16\u003c/b\u003e, 111\u0026ndash;116 (2019). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41592-018-0235-4\u003c/span\u003e\u003cspan address=\"10.1038/s41592-018-0235-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFedorenko, E., Behr, M.K., Kanwisher, N.: Functional specificity for high-level linguistic processing in the human brain. Proceedings of the National Academy of Sciences 108, 16428\u0026ndash;16433. (2011). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1073/pnas.1112937108\u003c/span\u003e\u003cspan address=\"10.1073/pnas.1112937108\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFischl, B., van der Kouwe, A., Destrieux, C., Halgren, E., S\u0026eacute;gonne, F., Salat, D.H., Busa, E., Seidman, L.J., Goldstein, J., Kennedy, D., Caviness, V., Makris, N., Rosen, B., Dale, A.M.: Automatically parcellating the human cerebral cortex. Cereb. Cortex. \u003cb\u003e14\u003c/b\u003e, 11\u0026ndash;22 (2004). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/cercor/bhg087\u003c/span\u003e\u003cspan address=\"10.1093/cercor/bhg087\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGrand, G., Blank, I.A., Pereira, F., Fedorenko, E.: Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nat. Hum. Behav. 1\u0026ndash;13 (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41562-022-01316-8\u003c/span\u003e\u003cspan address=\"10.1038/s41562-022-01316-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eG\u0026uuml;\u0026ccedil;l\u0026uuml;, U., Gerven, M.A.J., van: Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream. J. Neurosci. \u003cb\u003e35\u003c/b\u003e, 10005\u0026ndash;10014 (2015). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1523/JNEUROSCI.5023-14.2015\u003c/span\u003e\u003cspan address=\"10.1523/JNEUROSCI.5023-14.2015\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). pp. 770\u0026ndash;778. (2016)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHorikawa, T., Kamitani, Y.: Generic decoding of seen and imagined objects using hierarchical visual features. Nat. Commun. \u003cb\u003e8\u003c/b\u003e, 15037 (2017). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/ncomms15037\u003c/span\u003e\u003cspan address=\"10.1038/ncomms15037\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuth, A.G., de Heer, W.A., Griffiths, T.L., Theunissen, F.E., Gallant, J.L.: Natural speech reveals the semantic maps that tile human cerebral cortex. Nature. \u003cb\u003e532\u003c/b\u003e, 453\u0026ndash;458 (2016). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/nature17637\u003c/span\u003e\u003cspan address=\"10.1038/nature17637\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuth, A.G., Nishimoto, S., Vu, A.T., Gallant, J.L.: A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories across the Human Brain. Neuron 无. \u003cb\u003e76\u003c/b\u003e, 1210\u0026ndash;1224 (2012). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.neuron.2012.10.014\u003c/span\u003e\u003cspan address=\"10.1016/j.neuron.2012.10.014\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKay, K., Rokem, A., Winawer, J., Dougherty, R., Wandell, B.: GLMdenoise: a fast, automated technique for denoising task-based fMRI data. Frontiers in neuroscience, 无 247. (2013)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision, in: International Conference on Machine Learning(ICML). PMLR, pp. 5583\u0026ndash;5594. (2021)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeBel, A., Jain, S., Huth, A.G.: Voxelwise Encoding Models Show That Cerebellar Language Representations Are Highly Conceptual. J. Neurosci. \u003cb\u003e41\u003c/b\u003e, 10341\u0026ndash;10355 (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1523/JNEUROSCI.0118-21.2021\u003c/span\u003e\u003cspan address=\"10.1523/JNEUROSCI.0118-21.2021\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: VisualBERT: A Simple and Performant Baseline for Vision and Language. (2019). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.1908.03557\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.1908.03557\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u0026aacute;r, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision \u0026ndash; ECCV 2014, pp. 740\u0026ndash;755. Springer International Publishing, Cham (2014). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/978-3-319-10602-1_48\u003c/span\u003e\u003cspan address=\"10.1007/978-3-319-10602-1_48\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLupyan, G., Abdel Rahman, R., Boroditsky, L., Clark, A.: Effects of Language on Visual Perception. Trends Cogn. Sci. \u003cb\u003e24\u003c/b\u003e, 930\u0026ndash;944 (2020). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.tics.2020.08.005\u003c/span\u003e\u003cspan address=\"10.1016/j.tics.2020.08.005\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMalik-Moraleda, S., Ayyash, D., Gall\u0026eacute;e, J., Affourtit, J., Hoffmann, M., Mineroff, Z., Jouravlev, O., Fedorenko, E.: An investigation across 45 languages and 12 language families reveals a universal language network. Nat. Neurosci. \u003cb\u003e1\u0026ndash;6\u003c/b\u003e (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41593-022-01114-5\u003c/span\u003e\u003cspan address=\"10.1038/s41593-022-01114-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMillidge, B., Seth, A., Buckley, C.L., Predictive Coding: a Theoretical and, Review, E.: (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2107.12979\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2107.12979\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMitchell, T.M., Shinkareva, S.V., Carlson, A., Chang, K.-M., Malave, V.L., Mason, R.A., Just, M.A.: Predicting Human Brain Activity Associated with the Meanings of Nouns. Science. \u003cb\u003e320\u003c/b\u003e, 1191\u0026ndash;1195 (2008). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1126/science.1152876\u003c/span\u003e\u003cspan address=\"10.1126/science.1152876\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNishida, S., Nishimoto, S.: New advances in encoding and decoding of brain signals, vol. 180, pp. 232\u0026ndash;242. NeuroImage (2018). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.neuroimage.2017.08.017\u003c/span\u003e\u003cspan address=\"10.1016/j.neuroimage.2017.08.017\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e Decoding naturalistic experiences from human brain activity via distributed representations of words\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNishimoto, S., Vu, A.T., Naselaris, T., Benjamini, Y., Yu, B., Gallant, J.L.: Reconstructing visual experiences from brain activity evoked by natural movies. Curr. Biol. \u003cb\u003e21\u003c/b\u003e, 1641\u0026ndash;1646 (2011). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cub.2011.08.031\u003c/span\u003e\u003cspan address=\"10.1016/j.cub.2011.08.031\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePopham, S.F., Huth, A.G., Bilenko, N.Y., Deniz, F., Gao, J.S., Nunez-Elizalde, A.O., Gallant, J.L.: Visual and linguistic semantic representations are aligned at the border of human visual cortex. Nat. Neurosci. \u003cb\u003e24\u003c/b\u003e, 1628\u0026ndash;1636 (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41593-021-00921-6\u003c/span\u003e\u003cspan address=\"10.1038/s41593-021-00921-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePrince, J.S., Charest, I., Kurzawski, J.W., Pyles, J.A., Tarr, M.J., Kay, K.N.: Improving the accuracy of single-trial fMRI response estimates using GLMsingle. Elife, 无 11, e77599. (2022)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRadford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2103.00020\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2103.00020\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSalvatori, T., Song, Y., Lukasiewicz, T., Bogacz, R., Xu, Z.: Predictive Coding Can Do Exact Backpropagation on Convolutional and Recurrent Neural Networks. (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2103.03725\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2103.03725\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchrimpf, M., Kubilius, J., Hong, H., Majaj, N.J., Rajalingham, R., Issa, E.B., Kar, K., Bashivan, P., Prescott-Roy, J., Geiger, F., Schmidt, K., Yamins, D.L.K., DiCarlo, J.J.: Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? bioRxiv 407007. (2020). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1101/407007\u003c/span\u003e\u003cspan address=\"10.1101/407007\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShen, G., Horikawa, T., Majima, K., Kamitani, Y.: Deep image reconstruction from human brain activity. PLoS Comput. Biol. \u003cb\u003e15\u003c/b\u003e, e1006633 (2019). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pcbi.1006633\u003c/span\u003e\u003cspan address=\"10.1371/journal.pcbi.1006633\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, A.Y., Kay, K., Naselaris, T., Tarr, M.J., Wehbe, L.: Nat. Mach. Intell. \u003cb\u003e5\u003c/b\u003e, 1415\u0026ndash;1426 (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s42256-023-00753-y\u003c/span\u003e\u003cspan address=\"10.1038/s42256-023-00753-y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, L., Mruczek, R.E., Arcaro, M.J., Kastner, S.: Probabilistic maps of visual topography in human cortex. Cerebral cortex, 无 25, 3911\u0026ndash;3931. (2015a)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, L., Mruczek, R.E.B., Arcaro, M.J., Kastner, S.: Probabilistic Maps of Visual Topography in Human Cortex. Cereb. Cortex. \u003cb\u003e25\u003c/b\u003e, 3911\u0026ndash;3931 (2015b). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/cercor/bhu277\u003c/span\u003e\u003cspan address=\"10.1093/cercor/bhu277\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYamins, D.L.K., DiCarlo, J.J.: Using goal-driven deep learning models to understand sensory cortex. Nat. Neurosci. \u003cb\u003e19\u003c/b\u003e, 356\u0026ndash;365 (2016). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/nn.4244\u003c/span\u003e\u003cspan address=\"10.1038/nn.4244\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYamins, D.L.K., Hong, H., Cadieu, C.F., Solomon, E.A., Seibert, D., DiCarlo, J.J.: Performance-optimized hierarchical models predict neural responses in higher visual cortex. PNAS. \u003cb\u003e111\u003c/b\u003e, 8619\u0026ndash;8624 (2014). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1073/pnas.1403112111\u003c/span\u003e\u003cspan address=\"10.1073/pnas.1403112111\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSupplementary: figures\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Encoding model, Natural scene understanding, Multimodal Integration, Vision-language models, Top-Down Modulation, fMRI","lastPublishedDoi":"10.21203/rs.3.rs-8259624/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8259624/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eNatural scene understanding requires the seamless integration of high-resolution sensory inputs with abstract conceptual knowledge. Conventional computational models often treat scene comprehension as a feed-forward, visual-centric process. Here, we challenge this view by proposing the Semantic Scaffold framework, positing that language-derived semantic knowledge acts as a foundational component that actively shapes visual perception. To test this, we leveraged unimodal (visual-only, language-only) and multimodal (visual-language) encoding models as computational probes on the massive 7T fMRI Natural Scenes Dataset (NSD) to systematically dissect the functional topography of the human cortex. We reveal a fundamental cortical dissociation: perceptually-driven visual features are confined to the visual cortex, whereas language-derived features robustly predict activity across expansive frontal and temporal association cortices. Crucially, multimodal integration is necessary to model neural activity at the interface of these systems, providing empirical support for an integrated mechanism where top-down semantic knowledge contextually modulates visual input. Furthermore, we characterize the internal structure of this semantic scaffold, revealing unified atlas organized along a dominant animate-inanimate axis with robust left-hemisphere lateralization. Our study repositions language-derived knowledge from a secondary consequence to a primary cognitive scaffold, advancing an integrated mechanistic understanding of how the human brain constructs a coherent perception of the world.\u003c/p\u003e \u003cdiv id=\"ASec1\" class=\"AbstractSection\"\u003e \u003cdiv class=\"Heading\"\u003eGraphic Abstract\u003c/div\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e","manuscriptTitle":"The Semantic Scaffold: Functional Dissociation of Visual and Language-derived Features Shapes Human Natural Scene Understanding","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-12 08:23:31","doi":"10.21203/rs.3.rs-8259624/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"communications-biology","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"commsbio","sideBox":"Learn more about [Communications Biology](http://www.nature.com/commsbio/)","snPcode":"","submissionUrl":"","title":"Communications Biology","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Communications Series","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"597d6a3b-8660-4e28-8a6c-3c6126537f3b","owner":[],"postedDate":"January 12th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":60692415,"name":"Biological sciences/Neuroscience/Computational neuroscience/Neural encoding"},{"id":60692416,"name":"Biological sciences/Neuroscience/Cognitive neuroscience/Perception"},{"id":60692417,"name":"Biological sciences/Neuroscience/Cognitive neuroscience/Language"}],"tags":[],"updatedAt":"2026-02-14T16:55:13+00:00","versionOfRecord":[],"versionCreatedAt":"2026-01-12 08:23:31","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8259624","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8259624","identity":"rs-8259624","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.