Decoding the Moving Mind: Multi-Subject fMRI-to-Video Retrieval with MLLM Semantic Grounding

doi:10.1101/2025.04.07.647335

Decoding the Moving Mind: Multi-Subject fMRI-to-Video Retrieval with MLLM Semantic Grounding

2025 · doi:10.1101/2025.04.07.647335

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 1,987 characters · extracted from oa-doi-fallback · click to expand

Abstract Decoding dynamic visual information from brain activity remains challenging due to inter-subject neural heterogeneity, limited per-subject data availability, and the substantial temporal resolution gap between fMRI signals (0.5 Hz) and video dynamics (30 Hz). Current approaches face persistent challenges in addressing these temporal mismatches, demonstrate limited capacity to integrate subject-specific neural patterns with shared representational frameworks, and lack adequate semantic granularity for aligning neural responses with visual content. To bridge these gaps, we propose a framework addressing these limitations through three innovations: (1) a Dynamic Temporal Alignment module that resolves temporal mismatches via exponentially weighted multi-frame fusion with adaptive decay coefficients; (2) a Brain Mixture-of-Experts architecture that combines subject-specific extractors with shared expert layers through parameter-efficient tri-modal contrastive learning; and (3) a Multi perspective Semantic Hyper-Anchoring module that resolves cross-subject attention bias via multi-dimensional semantic decomposition, leveraging multimodal LLMs for fine-grained video semantic extraction—enabling the model to match individual attention patterns as different subjects naturally focus on distinct aspects of the same visual stimulus. This module boosts Top-10/Top-100 retrieval by 17.7%/6.6%. Experiments on two video-fMRI datasets demonstrate state-of-the-art performance, with 39%/30% improvements in Top-10/Top-100 accuracy over single-subject baselines and 27% gains against multi-subject models. The framework exhibits remarkable few-shot adaptability, retaining 97% performance when using only 10% training data for new subjects. Visualization analysis confirms this generalization capability stems from effective disentanglement of subject-specific and shared neural representations. Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00