DS-BEV:An Efficient Multi-Modal Fusion in Object Detection with Unified Bird's-Eye View Representation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article DS-BEV:An Efficient Multi-Modal Fusion in Object Detection with Unified Bird's-Eye View Representation Minghui Hu This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4477033/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recently, BEVFusion has been proposed to integrate LiDAR and image features in a unified Bird's Eye View (BEV) representation. However, there is an issue with the loss of local image information during the extraction of global image information on the backbone network. In order to fully integrate local features with global features, this paper proposes a network called DS-BEV based on feature selection and refinement. It includes a Feature Selection Fusion module (FSM) and a Feature Refinement module (FRM). In the FSM, the features of different modal are first extracted by using specific networks and projected into a unified BEV representation space. Through channel and spatial learning, important information is selected from the initial features and fused to generate preliminary fusion features. Then,the image features extracted by a CNN network and the preliminary fusion features output by the FSM are sent to the FRM together. By combining the local features generated by CNN network, the fusion features are refined. We evaluate our model on the nuScenes dataset. Experiments show that our DS-BEV achieves 69.5% mAP and 72.3% NDS in detection accuracy. unified representation feature refinement feature selection Figures Figure 1 Figure 2 Figure 3 1 Introduction The autonomous driving system is equipped with different sensors. For example, Waymo's[ 1 ] self-driving car has 29 cameras, 6 radars and 5 lidars. Data from different sensors are represented in completely different ways : for example, the camera captures data in a perspective view and captures LiDAR in a 3D view. The sensor fusion strategy shows significant advantages in achieving stronger sensing capabilities. The previous method[ 2 – 6 ] uses other features to enhance the Lidar feature. Because the Lidar feature has rich spatial information, the image is projected into the point cloud feature. This method requires strong conversion ability, and it is easy to generate noise, resulting in increased difficulty in detection. In order to simplify and overcome the noise, researchers have proposed a candidate box-based fusion[ 7 ], which first generates candidate boxes on different modalities, and then jointly detects them on the candidate boxes. Although this method avoids complex point fusion, its accuracy is not significant. Recently, researchers have introduced a new fusion method to project multiple features into a unified representation space[ 8 – 11 ].This method uses a unified representation method to maintain both geometric structure and semantic density, which provides a new fusion idea for researchers. Under the unified BEV feature space, we can use the method of two-dimensional image to process BEV features. In this paper, we propose DS-BEV to further refine the fusion feature. On the basis of BEVFusion[ 10 ], we first enrich the channel information and spatial information through our FSM to generate preliminary fusion feature. Under the selection of the original features, we further enhance the channel features and spatial features. Then the image features will extracted by CNN network. The image features and the preliminary BEV feature are sent to the FRM. In order to reduce the computational cost, we use deformable-attention to instead the self-attention. In [ 12 ], it is mentioned that the computational complexity of deformable-attention is less than that of self-attention. In FRM, Deformable-attention only focuses on outside query, we add a self-attention module after deformable-attention to get inside relation of fusion feature. Finally generate high-quality fusion features. In order to evaluate the effectiveness of our proposed DS-BEV, we conducted experiments on the nuScenes dataset[ 13 ], and our detection accuracy was further improved to 69.5% mAP and 72.3% NDS. The main characteristics of DS-BEV : We propose a new framework of DS-BEV, which in the target detection task, has a good performance.We propose a FSM which can select channel and spatial information to enhace BEV features.We propose FRM,a BEV refinement module. 2 Related Work 2.1 LiDAR-Camera Fusion Multi-modal fusion is very significant in 3D detection tasks. Therefore, many researchers focus on how to better combine point clouds ( geographic information ) and images ( semantic information ). The existing methods mainly focus on the candidate box level[ 14 – 19 ], the point level[ 20 – 25 ] and the feature level [ 8 , 9 , 10 , 11 , 26 , 27 ]. MV3D[ 14 ] creates 3D proposal and projects to image. F-PointNet[ 15 ], F-ConvNet[ 16 ], TransFusion[ 17 ] create 2D proposal and projects to 3D. Point-level fusion methods, on the other hand, usually paint image semantic features onto foreground LiDAR points and perform LiDAR-based detection on the decorated point cloud inputs.Feature-level fusion uses different fusion methods in the feature stage to enhance modal interaction.In our module, in a unified BEV space, the fusion features are easy to fuse with the ways of 2D. 2.2. Deformable Attention In the traditional transformer, the attention calculation is usually global, and the amount of calculation will be very large when the feature size is large. Combined with deformable convolution, attention calculation based on deformable-attention is designed in Deformable DETR[ 12 ],DAT[ 31 ]. In Deformable DETR[ 12 ] ,query only calculates with the value of correlation points, which greatly reduces the amount of calculation. And the attention weight matrix is directly generated based on Query, and then calculated with the correlation points. In[ 31 ], the feature of the relevant points are connected with the paranoid information generated by the offset network to generate key, value. In our paper, we use Deformable DETR[ 12 ] to calculate attention. 2.3. BEV Feature-based Bird 's Eye View ( BEV ) is a perspective of viewing objects or scenes from above, just like a bird looking down at the ground in the air. In the field of autonomous driving and robotics, data obtained by sensors ( such as LiDAR and cameras ) are usually converted into BEV representations to better perform tasks such as object detection. BEV can simplify a complex three-dimensional environment into a two-dimensional image, which is particularly important for efficient calculation in real-time systems. BEVFormer[ 32 ] is a network structure based on Transformer[ 34 ], which applies deformable-attention mechanism to feature extraction on BEV. Compared to traditional CNN network, BEVFormer can better capture long-distance dependencies. BEVDet[ 35 ] rotates, cuts, and scales the original image, and needs to multiply the internal and external parameters of the camera by an inverse transformation. BevFusion[ 10 ] is a multi-sensor fusion technology that can fuse data from different sensors (such as LiDAR and cameras) into a unified BEV representation. BevFusion[ 10 ] can combine the advantages of multiple sensors to achieve better performance in object detection and tracking tasks. BEVDepth[ 36 ] is a deep estimation algorithm based on deep learning. It projects the point cloud data to the BEV, and then uses CNN network to predict the depth information of each pixel. BEVDepth[ 36 ] can produce high-quality depth maps, which is very useful for tasks such as navigation and object detection. 3 Method DS-BEV focuses on improving modal fusion, and we give the overall framework in the Fig. 1 . Given different inputs, we first apply a specific modal encoder to extract their features separately and convert them to a unified BEV representation space, and generate preliminary BEV fusion features through the FSM. With the original features from CNN network, the updated BEV fusion features are obtained through FRM. 3.1 Encoder After a given input, a specific modal backbone network is used to extract features. Based on the state-of-the-art perceptual method BEVFusion, we construct our multi-modal feature encoder, which takes multi-view images and LiDAR point pairs as inputs and converts camera features into BEV space with depth prediction and geometric projection, respectively. In addition, we have added a CNN network to extract image locate information through convolution. Specifically, the image input is converted into N 2D image features, and then the features are converted into 3D space to generate 3D voxel features, and finally compressed into BEV features. The Lidar feature is projected into the 3D voxel feature and mapped to the BEV feature. 3.2 FSM The FSM mainly consists of three parts : spatial attention module, channel attention module and feature selection module.Attention module :After a unified BEV space representation, the image features and Lidar features are position-dependent in the BEV space. We sample the feature map to the same scale, and use the spatial attention channel to supplement the problem of missing single feature information. Similarly, channel attention is used. Then, we combine the channel attention weight and spatial attention weight obtained by the fused modality with the BEV features before fusion, and integrate the new attention features into the fusion features through attention selection. The channel attention module helps the network focus on important channel features by learning the correlation between channels, thereby improving the representation ability of features and network performance. The spatial attention module helps the network to focus on the local important areas in the image and improve the understanding of the image spatial structure by learning the importance of spatial location. $${\text{F}}_{\text{f}\text{u}\text{s}\text{e}}=\text{C}\text{o}\text{n}\text{c}\text{a}\text{t}({\text{F}}_{\text{C}\text{B}},{\text{F}}_{\text{L}\text{B}})$$ 1 $${\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}={\text{F}}_{\text{f}\text{u}\text{s}\text{e}}+{\text{F}}_{\text{f}\text{u}\text{s}\text{e}}\otimes {\text{M}}_{\text{c}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}\right)$$ 2 $${\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }{\prime }}={\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}+{\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}\otimes {\text{M}}_{\text{s}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}\right)$$ 3 \({\text{M}}_{\text{c}}\) is the channel attention weight, and \({ \text{M}}_{\text{s}}\) is the spatial attention weight. Feature selection module : Through the previous step, the spatial attention and channel attention features, and the original features are obtained. We propose a feature selection module to assign weights to each feature. In the paper, the weights of each feature are different. $${\text{F}}_{\text{C}\text{B}}^{{\prime }}={\text{F}}_{\text{C}\text{B}}+{\text{F}}_{\text{C}\text{B}}\otimes {\text{M}}_{\text{c}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}\right)$$ 4 $${\text{F}}_{\text{C}\text{B}}^{{\prime }{\prime }}={\text{F}}_{\text{C}\text{B}}^{{\prime }}+{\text{F}}_{\text{C}\text{B}}^{{\prime }}\otimes {\text{M}}_{\text{s}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}\right)$$ 5 $${\text{F}}_{\text{F}\text{u}\text{s}\text{e}}={\phi }_{1}{\text{F}}_{\text{C}\text{B}}^{{\prime }{\prime }}+{\phi }_{2}{\text{F}}_{\text{L}\text{B}}^{{\prime }{\prime }}+{\phi }_{3}{\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }{\prime }}$$ 6 Among them, \({\phi }_{1},{\phi }_{2},{\phi }_{3}\) is the coefficient.After the fusion of three features, the fused BEV feature map is obtained. 3.3 DS-BEV Decoder After obtaining the preliminary fused BEV feature map, we further refine the fused features through the FRM. FRM consists of two parts : the first is the deformable attention module, and the second is the self-attention module.In BEVFormer[ 32 ], image features and BEV queries are input into spatial cross-attention. Inspired by BEVFormer[ 32 ] and DAT, we designed a deformable attention layer for our network. Specifically, we find the relevant points of the BEV query on the image features, and then the BEV query generates the attention matrix and bias information. The relevant points obtain the learned points (as Value ) according to the bias. Finally, the attention matrix is calculated with these related Values, and the relevant image features under the BEV query are obtained. Specifically, the image is formed by the point cloud through the feature angle, so we can use the camera parameters to map the points of the 3D space to the image. We first estimate the possible heights of each query on the BEV plane, and then project these points onto a 2D view. For a BEV query, the projected 2D points can only fall on some views, and other views will not be hit. Here, we call the hit view Vhit. After that, we regard these 2D points as the reference points of query and sample features from the hit views around these reference points. Finally, we use the weighted sum of sampling features as the output of deformable cross-attention. $$\text{D}\text{C}\text{A}({\text{Q}}_{\text{p}},{\text{F}}_{\text{t}})=\frac{1}{\left|\left.{\text{V}}_{\text{h}\text{i}\text{t}}\right|\right.}\sum _{\text{i}\in {\text{V}}_{\text{h}\text{i}\text{t}}}\sum _{\text{j}=1}^{{\text{N}}_{\text{r}\text{e}\text{f}}}\text{D}\text{e}\text{f}\text{o}\text{r}\text{m}\text{A}\text{t}\text{t}\text{n}({\text{Q}}_{\text{p}},{\rho }(\text{p},\text{i},\text{j}),{\text{F}}_{\text{t}}^{\text{i}})$$ 7 The \({\text{V}}_{\text{h}\text{i}\text{t}}\) is the hit views, \({\text{N}}_{\text{r}\text{e}\text{f}}\) is the correlation points of image feature. \({\rho }(\text{p},\text{i},\text{j})\) is the way 3D to 2D. \({\text{F}}_{\text{t}}^{\text{i}}\) is Point j of perspective i view. In MetaBEV[ 8 ], a self-attention layer is added after the cross-attention layer, and the effectiveness of the attention layer is verified by experiments. We set the self-attention layer after the deformable attention layer, which is different from the traditional method. DS-BEV Decoder can capture the relationship between external queries and internal queries at the same time. 4 Experiments In this section, we introduce the specific experimental settings in detail.And the performances on 3D detection is presented to validate the effectiveness, flexibility of our DS-BEV. 4.1 Implementation Details 4.1.1 Network architecture Our network is based on the BEVFusion architecture, with Swin-T[ 37 ] and VoxelNet[ 38 ] used as feature encoders for cameras and LiDAR, respectively. And another image Net is ResNet-50[ 39 ]. In the FSM module, we transfer the three levels of features to the same size 180×180. In the DS-BEV Decoder, we use one deformable-attention layer and one self-attention layer to generate a fused BEV. In the deformable-attention, we use 4 related points, that 4 related points have good performance. Table 1 Comparisons with Sota methods on nuScenes val set.* is reported from [ 10 ]. Method Modality Resolution mAP NDS BEVDepth-R50 [ 36 ] C 256×704 35.1 47.5 BEVFormer [ 32 ] C 900×1600 41.6 51.7 CenterPoint [ 40 ] L - 59.6 66.8 TransFusion-L [ 19 ] L - 65.5 70.2 FUTR3D [ 18 ] C + L - 64.5 68.3 UVTR [ 41 ] C + L - 65.4 70.2 PointPainting [ 20 ] C + L - 65.8 69.6 MVP* [ 42 ] C + L - 66.1 70 AutoAlign [ 24 ] C + L - 66.6 71.1 PointAugmenting [ 21 ] C + L - 66.8 71 TransFusion [ 43 ] C + L 448×800 67.5 71.3 BEVFusion [ 9 ] C + L 448×800 69.6 72.1 DeepInteraction [ 44 ] C + L 640×1600 69.9 72.6 MetaBEV [ 8 ] C + L - 68.0 71.5 BEVFusion [ 10 ] C + L 256×704 68.5 71.4 EA-BEV [ 11 ] C + L 256×704 69.4 71.8 DS-BEV(Ours) C + L 256x704 69.5 72.3 4.12 Data sets and evaluation indicators We evaluated DS-BEV on nuScenes[ 13 ], a large-scale multi-modal dataset for 3D detection. The dataset is divided into 700 / 150 / 150 scenarios for training / verification / testing, including 40k labeled samples and 23 different classes. It contains data from multiple sensors, including six cameras, one lidar and five radars. For camera input, each frame consists of six views of the environment around a specific timestamp. We adjust the size of the input view to 256 × 704 resolution and voxelize the point cloud to 0.075m and 0.1m for detection. Our evaluation indicators are consistent with [ 13 ]. For 3D detection, we use standard nuScenes detection score ( NDS ) and mean average precision ( mAP ). We use the average precision ( mAP ) of 10 foreground classes and nuScenes detection score ( NDS ) as our detection metric. 4.2. Comparison Results Table 1 reports the experimental results of the nuScenes 3D object detection verification data set. At the baseline of BEVFusion[ 10 ], the mAP score and NDS score increased by 1.5% and 0.9%, respectively. Our DS-BEV is close to BEVFusion[ 9 ]. In Table 2 , we report the experimental results of the nuScenes 3D object detection test data set. The average mAP score and average NDS score of 10 foreground classes on the test set. On the BEVFusion[ 10 ] baseline, the mAP score and NDS score increased by 1.0% and 0.9%, respectively. Table 2 Comparisons with Sota methods on nuScenes test set. Method Modality mAP↑ NDS↑ mATE↓ mASE↓ mAOE↓ mAVE↓ mAAE↓ BEVDet [ 35 ] C 42.2 48.2 0.529 0.236 0.396 0.979 0.152 BEVFormer [ 32 ] C 44.5 53.5 0.582 0.256 0.375 0.378 1.123 Pointpillars [ 46 ] L 30.5 45.3 0.517 0.29 0.5 0.316 0.368 SECOND [ 47 ] L 52.8 63.3 - - - - - CenterPoint [ 40 ] L 60.3 67.3 0.262 0.239 0.361 0.288 0.136 PointAugmenting [ 21 ] C + L 66.8 71.0 0.253 0.235 0.354 0.266 0.123 MVP [ 22 ] C + L 66.4 70.5 0.263 0.238 0.321 0.313 0.134 TransFusion [ 43 ] C + L 68.9 71.3 0.259 0.243 0.359 0.288 0.127 CMT [ 48 ] C + L 70.4 73.0 0.299 0.241 0.323 0.240 0.112 DeepInteraction [ 44 ] C + L 70.8 73.4 0.257 0.240 0.325 0.245 0.128 BEVFusion [ 10 ] C + L 71.3 73.3 0.250 0.240 0.359 0.254 0.132 DS-BEV(Ours) C + L 71.5 73.6 0.251 0.238 0.345 0.255 0.122 4.3 Ablation Studies and Discussions In order to test the effectiveness of each module, Table 4 is the ablation study of FSM ,C denotes Channel attention,S denotes spacial attention .Table 3 describes the ablation experiment of our module. D denotes that we use the DCA module, and S denotes our SA module. The mAP and NDS in the table show the effectiveness of our module.Table 4 is the number of relevant points,it was find when the number of relevant points is 8,the model has the best performance . Table 3 Ablation study of DS-BEV. FSM FRM mAP(%) NDS(%) C S D S ✘ ✘ ✘ ✘ 68.5 71.4 ✔ ✘ ✘ ✘ 68.6 71.5 ✔ ✔ ✘ ✘ 68.7 71.8 ✔ ✔ ✔ ✘ 69.2 72 ✔ ✔ ✔ ✔ 69.5 72.3 Table 4 Ablation study of related point. R-P mAP NDS 2 68.7 71.9 4 68.8 72 8 69.1 71.9 5 Conclusions In this paper, we propose a fusion method of camera features and Lidar features in BEV mode to improve the accuracy of model detection. By combining the weights of channel attention and spatial attention through a specially designed fusion module for attention, enhancing the expression of key features, and combining DCA and self-attention to enhance local and overall coordination, our method can show advanced performance on the challenging nuScenes data machine. Our paper still has follow-up tasks. Experiments on different basic networks prove that our model is robust and universal. Declarations Author Contribution All the work is done by H. References MEI J, ZHU A, YAN X, et al. Waymo Open Dataset: Panoramic Video Panoptic Segmentation[J]. LIANG M, YANG B, WANG S, et al. Deep Continuous Fusion for Multi-Sensor 3D Object Detection[M/OL]//Computer Vision – ECCV 2018,Lecture Notes in Computer Science. 2018: 663-678. http://dx.doi.org/10.1007/978-3-030-01270-0_39. DOI:10.1007/978-3-030-01270-0_39. Liang, Ming, et al. “Multi-Task Multi-Sensor Fusion for 3D Object Detection.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, https://doi.org/10.1109/cvpr.2019.00752. Nabati, Ramin, and Hairong Qi. “CenterFusion: Center-Based Radar and Camera Fusion for 3D Object Detection.” 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, https://doi.org/10.1109/wacv48630.2021.00157. Xie, Liang, et al. “PI-RCNN: An Efficient Multi-Sensor 3D Object Detector with Point-Based Attentive Cont-Conv Fusion Module.” Cornell University - arXiv,Cornell University - arXiv, Nov. 2019. Zhang, Haolin, et al. “Faraway-Frustum: Dealing with Lidar Sparsity for 3D Object Detection Using Fusion.” 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021, https://doi.org/10.1109/itsc48978.2021.9564990. Pang, Su, et al. “CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection.” 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, https://doi.org/10.1109/iros45743.2020.9341791. Ge, Chongjian, et al. MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation. Apr. 2023. Liang, Tingting, et al. BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework. Liu, Zhijian, et al. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. Haotian, Haotian, et al. EA-BEV: Edge-Aware Bird’ s-Eye-View Projector for 3D Object Detection. Mar. 2023. Zhu, Xizhou, et al. “Deformable DETR: Deformable Transformers for End-to-End Object Detection.” arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Oct. 2020. Caesar, Holger, et al. “nuScenes: A Multimodal Dataset for Autonomous Driving.” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, https://doi.org/10.1109/cvpr42600.2020.01164. Chen, Xiaozhi, et al. “Multi-View 3D Object Detection Network for Autonomous Driving.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, https://doi.org/10.1109/cvpr.2017.691. Qi, Charles R., et al. “Frustum PointNets for 3D Object Detection from RGB-D Data.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, https://doi.org/10.1109/cvpr.2018.00102. Wang, Zhixin, and Kui Jia. “Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection.” 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, https://doi.org/10.1109/iros40897.2019.8968513. Nabati, Ramin, and Hairong Qi. “CenterFusion: Center-Based Radar and Camera Fusion for 3D Object Detection.” 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, https://doi.org/10.1109/wacv48630.2021.00157. Chen, Xuanyao, et al. FUTR3D: A Unified Sensor Fusion Framework for 3D Detection. Bai, Xuyang, et al. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. Vora, Sourabh, et al. “PointPainting: Sequential Fusion for 3D Object Detection.” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, https://doi.org/10.1109/cvpr42600.2020.00466. Wang, Chunwei, et al. “PointAugmenting: Cross-Modal Augmentation for 3D Object Detection.” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, https://doi.org/10.1109/cvpr46437.2021.01162. Yin, Tianwei, et al. “Multimodal Virtual Point 3D Detection.” Cornell University - arXiv,Cornell University - arXiv, Dec. 2021. Xu, Shaoqing, et al. “FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection.” 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021, https://doi.org/10.1109/itsc48978.2021.9564951. Chen, Zehui, et al. “AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection.” Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022, https://doi.org/10.24963/ijcai.2022/116. Chen, Yukang, et al. Focal Sparse Convolutional Networks for 3D Object Detection. Apr. 2022. Liang, Ming, et al. “Deep Continuous Fusion for Multi-Sensor 3D Object Detection.” Computer Vision – ECCV 2018,Lecture Notes in Computer Science, 2018, pp. 663–78, https://doi.org/10.1007/978-3-030-01270-0_39. Li, Yingwei, et al. DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection. Haotian, Haotian, et al. EA-BEV: Edge-Aware Bird’ s-Eye-View Projector for 3D Object Detection. Mar. 2023. Ge, Chongjian, et al. MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation. Apr. 2023. Xie, Yichen, et al. SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection. Xia, Zhuofan, et al. Vision Transformer with Deformable Attention. Li, Zhiqi, et al. “BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers.” Lecture Notes in Computer Science,Computer Vision – ECCV 2022, 2022, pp. 1–18, https://doi.org/10.1007/978-3-031-20077-9_1. Zou, Jiayu, et al. DiffBEV: Conditional Diffusion Model for Bird’s Eye View Perception. Mar. 2023. Vaswani, Ashish, et al. “Attention Is All You Need.” Neural Information Processing Systems,Neural Information Processing Systems, June 2017. Huang, Junjie, et al. BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View. Li, Yinhao, et al. BEVDepth: Acquisition of Reliable Depth for Multi-View 3D Object Detection. Lin, Liting, et al. “SwinTrack: A Simple and Strong Baseline for Transformer Tracking.” arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Dec. 2021. Zhou, Yin, and Oncel Tuzel. “VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, https://doi.org/10.1109/cvpr.2018.00472. He, Kaiming, et al. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, https://doi.org/10.1109/cvpr.2016.90. Yin, Tianwei, et al. “Center-Based 3D Object Detection and Tracking.” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, https://doi.org/10.1109/cvpr46437.2021.01161. Li, Yanwei, et al. Unifying Voxel-Based Representation with Transformer for 3D Object Detection. June 2022. Pan, Liang, et al. Multi-View Partial (MVP) Point Cloud Challenge 2021 on Completion and Registration: Methods and Results. Bai, Xuyang, et al. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. Yang, Zeyu, et al. DeepInteraction: 3D Object Detection via Modality Interaction. Aug. 2022. Huang, Junjie, et al. BEVDet4D: Exploit Temporal Cues in Multi-Camera 3D Object Detection. Lang, Alex H., et al. “PointPillars: Fast Encoders for Object Detection from Point Clouds.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, https://doi.org/10.1109/cvpr.2019.01298. Yan, Yan, et al. “SECOND: Sparsely Embedded Convolutional Detection.” Sensors, Oct. 2018, p. 3337, https://doi.org/10.3390/s18103337. Yan, Junjie, et al. Cross Modal Transformer: Towards Fast and Robust 3D Object Detection. Jan. 2023. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4477033","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":307923850,"identity":"6521980b-606c-4fe1-b0c2-3511d74e7e74","order_by":0,"name":"Minghui Hu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA1UlEQVRIie3RoQrCUBTG8SsXruXi6jcm8wmECwNxRV9lMuOCcRYdDLToAwg+hCCI8chgltlntJgM2g1WTbs2wfvvPzgfhzGT6QcTrYzIfoJb9ZT0SANiQH7SrduLPNAjLqRH4yS2VBkpzcOchaLzHg4rikd5Yz23nVSRZj46rAp4tfly669Z6HWoirBwk0Eg5PK0cySjwa6aBCp7CkxniK6aBENF9gxcIhKaROYBoQCHzD1/rTS2tOZpdkc84f1jeilvcc+tJB9Bar7mnXwrTCaT6S96AXOFQutd3J3PAAAAAElFTkSuQmCC","orcid":"","institution":"Inner Mongolia University","correspondingAuthor":true,"prefix":"","firstName":"Minghui","middleName":"","lastName":"Hu","suffix":""}],"badges":[],"createdAt":"2024-05-25 13:44:18","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4477033/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4477033/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":57427089,"identity":"42d6eb7d-4702-4a18-97ee-8a8db7f2016d","added_by":"auto","created_at":"2024-05-30 14:30:19","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":71687,"visible":true,"origin":"","legend":"\u003cp\u003eThe overall structure of the network\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-4477033/v1/5ae6a2a0f40759997a6f0d70.png"},{"id":57427088,"identity":"5b5fc3e7-1002-4dfb-a343-d4ee500475a7","added_by":"auto","created_at":"2024-05-30 14:30:19","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":58936,"visible":true,"origin":"","legend":"\u003cp\u003eThe overall structure of FSM module. The FSM module mainly consists of three parts : spatial attention module, channel attention module and feature selection module\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-4477033/v1/ebcee413b3cfd8e2a40c1ff9.png"},{"id":57427090,"identity":"5ea27d55-c448-4854-9863-d385955cc787","added_by":"auto","created_at":"2024-05-30 14:30:19","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":36082,"visible":true,"origin":"","legend":"\u003cp\u003eThe overall structure of DS-BEV Decoder.Image features from cnn network,and fused features are given, and the weight of related points is calculated by deformable-attention. self-attention is added after each deformable-attention to supplement the information of the Inside query.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-4477033/v1/c63c58db7318d9fd47a48aae.png"},{"id":57428255,"identity":"522d8daa-bffb-4e63-8d65-bcecc7edb809","added_by":"auto","created_at":"2024-05-30 14:38:20","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":665178,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4477033/v1/d3b36b1d-93c6-4aed-a2e9-10cd708ed22a.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"DS-BEV:An Efficient Multi-Modal Fusion in Object Detection with Unified Bird's-Eye View Representation","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eThe autonomous driving system is equipped with different sensors. For example, Waymo's[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e] self-driving car has 29 cameras, 6 radars and 5 lidars. Data from different sensors are represented in completely different ways : for example, the camera captures data in a perspective view and captures LiDAR in a 3D view. The sensor fusion strategy shows significant advantages in achieving stronger sensing capabilities.\u003c/p\u003e \u003cp\u003eThe previous method[\u003cspan additionalcitationids=\"CR3 CR4 CR5\" citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] uses other features to enhance the Lidar feature. Because the Lidar feature has rich spatial information, the image is projected into the point cloud feature. This method requires strong conversion ability, and it is easy to generate noise, resulting in increased difficulty in detection. In order to simplify and overcome the noise, researchers have proposed a candidate box-based fusion[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], which first generates candidate boxes on different modalities, and then jointly detects them on the candidate boxes. Although this method avoids complex point fusion, its accuracy is not significant. Recently, researchers have introduced a new fusion method to project multiple features into a unified representation space[\u003cspan additionalcitationids=\"CR9 CR10\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].This method uses a unified representation method to maintain both geometric structure and semantic density, which provides a new fusion idea for researchers. Under the unified BEV feature space, we can use the method of two-dimensional image to process BEV features.\u003c/p\u003e \u003cp\u003eIn this paper, we propose DS-BEV to further refine the fusion feature. On the basis of BEVFusion[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], we first enrich the channel information and spatial information through our FSM to generate preliminary fusion feature. Under the selection of the original features, we further enhance the channel features and spatial features. Then the image features will extracted by CNN network. The image features and the preliminary BEV feature are sent to the FRM. In order to reduce the computational cost, we use deformable-attention to instead the self-attention. In [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e], it is mentioned that the computational complexity of deformable-attention is less than that of self-attention. In FRM, Deformable-attention only focuses on outside query, we add a self-attention module after deformable-attention to get inside relation of fusion feature. Finally generate high-quality fusion features.\u003c/p\u003e \u003cp\u003eIn order to evaluate the effectiveness of our proposed DS-BEV, we conducted experiments on the nuScenes dataset[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], and our detection accuracy was further improved to 69.5% mAP and 72.3% NDS.\u003c/p\u003e \u003cp\u003eThe main characteristics of DS-BEV : We propose a new framework of DS-BEV, which in the target detection task, has a good performance.We propose a FSM which can select channel and spatial information to enhace BEV features.We propose FRM,a BEV refinement module.\u003c/p\u003e"},{"header":"2 Related Work","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 LiDAR-Camera Fusion\u003c/h2\u003e \u003cp\u003eMulti-modal fusion is very significant in 3D detection tasks. Therefore, many researchers focus on how to better combine point clouds ( geographic information ) and images ( semantic information ). The existing methods mainly focus on the candidate box level[\u003cspan additionalcitationids=\"CR15 CR16 CR17 CR18\" citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], the point level[\u003cspan additionalcitationids=\"CR21 CR22 CR23 CR24\" citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e] and the feature level [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. MV3D[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] creates 3D proposal and projects to image. F-PointNet[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e], F-ConvNet[\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e], TransFusion[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] create 2D proposal and projects to 3D. Point-level fusion methods, on the other hand, usually paint image semantic features onto foreground LiDAR points and perform LiDAR-based detection on the decorated point cloud inputs.Feature-level fusion uses different fusion methods in the feature stage to enhance modal interaction.In our module, in a unified BEV space, the fusion features are easy to fuse with the ways of 2D.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2. Deformable Attention\u003c/h2\u003e \u003cp\u003eIn the traditional transformer, the attention calculation is usually global, and the amount of calculation will be very large when the feature size is large. Combined with deformable convolution, attention calculation based on deformable-attention is designed in Deformable DETR[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e],DAT[\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. In Deformable DETR[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] ,query only calculates with the value of correlation points, which greatly reduces the amount of calculation. And the attention weight matrix is directly generated based on Query, and then calculated with the correlation points. In[\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e], the feature of the relevant points are connected with the paranoid information generated by the offset network to generate key, value. In our paper, we use Deformable DETR[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] to calculate attention.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3. BEV Feature-based\u003c/h2\u003e \u003cp\u003eBird 's Eye View ( BEV ) is a perspective of viewing objects or scenes from above, just like a bird looking down at the ground in the air. In the field of autonomous driving and robotics, data obtained by sensors ( such as LiDAR and cameras ) are usually converted into BEV representations to better perform tasks such as object detection. BEV can simplify a complex three-dimensional environment into a two-dimensional image, which is particularly important for efficient calculation in real-time systems. BEVFormer[\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] is a network structure based on Transformer[\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e], which applies deformable-attention mechanism to feature extraction on BEV. Compared to traditional CNN network, BEVFormer can better capture long-distance dependencies. BEVDet[\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e] rotates, cuts, and scales the original image, and needs to multiply the internal and external parameters of the camera by an inverse transformation. BevFusion[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] is a multi-sensor fusion technology that can fuse data from different sensors (such as LiDAR and cameras) into a unified BEV representation. BevFusion[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] can combine the advantages of multiple sensors to achieve better performance in object detection and tracking tasks. BEVDepth[\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e] is a deep estimation algorithm based on deep learning. It projects the point cloud data to the BEV, and then uses CNN network to predict the depth information of each pixel. BEVDepth[\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e] can produce high-quality depth maps, which is very useful for tasks such as navigation and object detection.\u003c/p\u003e \u003c/div\u003e"},{"header":"3 Method","content":"\u003cp\u003eDS-BEV focuses on improving modal fusion, and we give the overall framework in the Fig. \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e. Given different inputs, we first apply a specific modal encoder to extract their features separately and convert them to a unified BEV representation space, and generate preliminary BEV fusion features through the FSM. With the original features from CNN network, the updated BEV fusion features are obtained through FRM.\u003c/p\u003e\n\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\n \u003ch2\u003e3.1 Encoder\u003c/h2\u003e\n \u003cp\u003eAfter a given input, a specific modal backbone network is used to extract features. Based on the state-of-the-art perceptual method BEVFusion, we construct our multi-modal feature encoder, which takes multi-view images and LiDAR point pairs as inputs and converts camera features into BEV space with depth prediction and geometric projection, respectively. In addition, we have added a CNN network to extract image locate information through convolution. Specifically, the image input is converted into N 2D image features, and then the features are converted into 3D space to generate 3D voxel features, and finally compressed into BEV features. The Lidar feature is projected into the 3D voxel feature and mapped to the BEV feature.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n \u003ch2\u003e3.2 FSM\u003c/h2\u003e\n \u003cp\u003eThe FSM mainly consists of three parts : spatial attention module, channel attention module and feature selection module.Attention module :After a unified BEV space representation, the image features and Lidar features are position-dependent in the BEV space. We sample the feature map to the same scale, and use the spatial attention channel to supplement the problem of missing single feature information. Similarly, channel attention is used. Then, we combine the channel attention weight and spatial attention weight obtained by the fused modality with the BEV features before fusion, and integrate the new attention features into the fusion features through attention selection. The channel attention module helps the network focus on important channel features by learning the correlation between channels, thereby improving the representation ability of features and network performance. The spatial attention module helps the network to focus on the local important areas in the image and improve the understanding of the image spatial structure by learning the importance of spatial location.\u003c/p\u003e\n \u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e$${\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}=\\text{C}\\text{o}\\text{n}\\text{c}\\text{a}\\text{t}({\\text{F}}_{\\text{C}\\text{B}},{\\text{F}}_{\\text{L}\\text{B}})$$\u003c/div\u003e\n \u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e$${\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}^{{\\prime }}={\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}+{\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}\\otimes {\\text{M}}_{\\text{c}}\\left({\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}\\right)$$\u003c/div\u003e\n \u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e$${\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}^{{\\prime }{\\prime }}={\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}^{{\\prime }}+{\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}^{{\\prime }}\\otimes {\\text{M}}_{\\text{s}}\\left({\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}^{{\\prime }}\\right)$$\u003c/div\u003e\n \u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\n \u003c/div\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u0026nbsp;\u003cspan class=\"mathinline\"\u003e\\({\\text{M}}_{\\text{c}}\\)\u003c/span\u003e\u0026nbsp;\u003c/span\u003e is the channel attention weight, and\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({ \\text{M}}_{\\text{s}}\\)\u003c/span\u003e\u003c/span\u003e is the spatial attention weight.\u003c/p\u003e\n \u003cp\u003eFeature selection module : Through the previous step, the spatial attention and channel attention features, and the original features are obtained. We propose a feature selection module to assign weights to each feature. In the paper, the weights of each feature are different.\u003c/p\u003e\n \u003cdiv id=\"Equ4\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equ4\" name=\"EquationSource\"\u003e$${\\text{F}}_{\\text{C}\\text{B}}^{{\\prime }}={\\text{F}}_{\\text{C}\\text{B}}+{\\text{F}}_{\\text{C}\\text{B}}\\otimes {\\text{M}}_{\\text{c}}\\left({\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}\\right)$$\u003c/div\u003e\n \u003cdiv class=\"EquationNumber\"\u003e4\u003c/div\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Equ5\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equ5\" name=\"EquationSource\"\u003e$${\\text{F}}_{\\text{C}\\text{B}}^{{\\prime }{\\prime }}={\\text{F}}_{\\text{C}\\text{B}}^{{\\prime }}+{\\text{F}}_{\\text{C}\\text{B}}^{{\\prime }}\\otimes {\\text{M}}_{\\text{s}}\\left({\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}^{{\\prime }}\\right)$$\u003c/div\u003e\n \u003cdiv class=\"EquationNumber\"\u003e5\u003c/div\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Equ6\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equ6\" name=\"EquationSource\"\u003e$${\\text{F}}_{\\text{F}\\text{u}\\text{s}\\text{e}}={\\phi }_{1}{\\text{F}}_{\\text{C}\\text{B}}^{{\\prime }{\\prime }}+{\\phi }_{2}{\\text{F}}_{\\text{L}\\text{B}}^{{\\prime }{\\prime }}+{\\phi }_{3}{\\text{F}}_{\\text{f}\\text{u}\\text{s}\\text{e}}^{{\\prime }{\\prime }}$$\u003c/div\u003e\n \u003cdiv class=\"EquationNumber\"\u003e6\u003c/div\u003e\n \u003c/div\u003e\n \u003cp\u003eAmong them, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\phi }_{1},{\\phi }_{2},{\\phi }_{3}\\)\u003c/span\u003e\u003c/span\u003e is the coefficient.After the fusion of three features, the fused BEV feature map is obtained.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\n \u003ch2\u003e3.3 DS-BEV Decoder\u003c/h2\u003e\n \u003cp\u003eAfter obtaining the preliminary fused BEV feature map, we further refine the fused features through the FRM. FRM consists of two parts : the first is the deformable attention module, and the second is the self-attention module.In BEVFormer[\u003cspan class=\"CitationRef\"\u003e32\u003c/span\u003e], image features and BEV queries are input into spatial cross-attention. Inspired by BEVFormer[\u003cspan class=\"CitationRef\"\u003e32\u003c/span\u003e] and DAT, we designed a deformable attention layer for our network. Specifically, we find the relevant points of the BEV query on the image features, and then the BEV query generates the attention matrix and bias information. The relevant points obtain the learned points (as Value ) according to the bias. Finally, the attention matrix is calculated with these related Values, and the relevant image features under the BEV query are obtained. Specifically, the image is formed by the point cloud through the feature angle, so we can use the camera parameters to map the points of the 3D space to the image. We first estimate the possible heights of each query on the BEV plane, and then project these points onto a 2D view. For a BEV query, the projected 2D points can only fall on some views, and other views will not be hit. Here, we call the hit view Vhit. After that, we regard these 2D points as the reference points of query and sample features from the hit views around these reference points. Finally, we use the weighted sum of sampling features as the output of deformable cross-attention.\u003c/p\u003e\n \u003cdiv id=\"Equ7\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equ7\" name=\"EquationSource\"\u003e$$\\text{D}\\text{C}\\text{A}({\\text{Q}}_{\\text{p}},{\\text{F}}_{\\text{t}})=\\frac{1}{\\left|\\left.{\\text{V}}_{\\text{h}\\text{i}\\text{t}}\\right|\\right.}\\sum _{\\text{i}\\in {\\text{V}}_{\\text{h}\\text{i}\\text{t}}}\\sum _{\\text{j}=1}^{{\\text{N}}_{\\text{r}\\text{e}\\text{f}}}\\text{D}\\text{e}\\text{f}\\text{o}\\text{r}\\text{m}\\text{A}\\text{t}\\text{t}\\text{n}({\\text{Q}}_{\\text{p}},{\\rho }(\\text{p},\\text{i},\\text{j}),{\\text{F}}_{\\text{t}}^{\\text{i}})$$\u003c/div\u003e\n \u003cdiv class=\"EquationNumber\"\u003e7\u003c/div\u003e\n \u003c/div\u003e\n \u003cp\u003eThe \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\text{V}}_{\\text{h}\\text{i}\\text{t}}\\)\u003c/span\u003e\u003c/span\u003e is the hit views,\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\text{N}}_{\\text{r}\\text{e}\\text{f}}\\)\u003c/span\u003e\u003c/span\u003e is the correlation points of image feature.\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\rho }(\\text{p},\\text{i},\\text{j})\\)\u003c/span\u003e\u003c/span\u003e is the way 3D to 2D. \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\text{F}}_{\\text{t}}^{\\text{i}}\\)\u003c/span\u003e\u003c/span\u003e is Point j of perspective i view.\u003c/p\u003e\n \u003cp\u003eIn MetaBEV[\u003cspan class=\"CitationRef\"\u003e8\u003c/span\u003e], a self-attention layer is added after the cross-attention layer, and the effectiveness of the attention layer is verified by experiments. We set the self-attention layer after the deformable attention layer, which is different from the traditional method. DS-BEV Decoder can capture the relationship between external queries and internal queries at the same time.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"4 Experiments","content":"\u003cp\u003eIn this section, we introduce the specific experimental settings in detail.And the performances on 3D detection is presented to validate the effectiveness, flexibility of our DS-BEV.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Implementation Details\u003c/h2\u003e \u003cdiv id=\"Sec12\" class=\"Section3\"\u003e \u003ch2\u003e4.1.1 Network architecture\u003c/h2\u003e \u003cp\u003eOur network is based on the BEVFusion architecture, with Swin-T[\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e] and VoxelNet[\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e] used as feature encoders for cameras and LiDAR, respectively. And another image Net is ResNet-50[\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e]. In the FSM module, we transfer the three levels of features to the same size 180\u0026times;180. In the DS-BEV Decoder, we use one deformable-attention layer and one self-attention layer to generate a fused BEV. In the deformable-attention, we use 4 related points, that 4 related points have good performance.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparisons with Sota methods on nuScenes val set.* is reported from [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMethod\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModality\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eResolution\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003emAP\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNDS\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBEVDepth-R50 [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e256\u0026times;704\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e35.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e47.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBEVFormer [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e900\u0026times;1600\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e41.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e51.7\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCenterPoint [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e59.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e66.8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTransFusion-L [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e65.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e70.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFUTR3D [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e64.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e68.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUVTR [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e65.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e70.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePointPainting [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e65.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e69.6\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMVP* [\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e66.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAutoAlign [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e66.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e71.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePointAugmenting [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e66.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e71\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTransFusion [\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e448\u0026times;800\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e67.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e71.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBEVFusion [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e448\u0026times;800\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e69.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e72.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepInteraction [\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e640\u0026times;1600\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e69.9\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e72.6\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMetaBEV [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e68.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e71.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBEVFusion [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e256\u0026times;704\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e68.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e71.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEA-BEV [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e256\u0026times;704\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e69.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e71.8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eDS-BEV(Ours)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e256x704\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e69.5\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e72.3\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e4.12 Data sets and evaluation indicators\u003c/h2\u003e \u003cp\u003eWe evaluated DS-BEV on nuScenes[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], a large-scale multi-modal dataset for 3D detection. The dataset is divided into 700 / 150 / 150 scenarios for training / verification / testing, including 40k labeled samples and 23 different classes. It contains data from multiple sensors, including six cameras, one lidar and five radars. For camera input, each frame consists of six views of the environment around a specific timestamp. We adjust the size of the input view to 256 \u0026times; 704 resolution and voxelize the point cloud to 0.075m and 0.1m for detection. Our evaluation indicators are consistent with [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. For 3D detection, we use standard nuScenes detection score ( NDS ) and mean average precision ( mAP ). We use the average precision ( mAP ) of 10 foreground classes and nuScenes detection score ( NDS ) as our detection metric.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e4.2. Comparison Results\u003c/h2\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e reports the experimental results of the nuScenes 3D object detection verification data set. At the baseline of BEVFusion[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], the mAP score and NDS score increased by 1.5% and 0.9%, respectively. Our DS-BEV is close to BEVFusion[\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, we report the experimental results of the nuScenes 3D object detection test data set. The average mAP score and average NDS score of 10 foreground classes on the test set. On the BEVFusion[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] baseline, the mAP score and NDS score increased by 1.0% and 0.9%, respectively.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparisons with Sota methods on nuScenes test set.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"9\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMethod\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModality\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003emAP\u0026uarr;\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNDS\u0026uarr;\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003emATE\u0026darr;\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003emASE\u0026darr;\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003emAOE\u0026darr;\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003emAVE\u0026darr;\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003emAAE\u0026darr;\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBEVDet [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e42.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e48.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.529\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.236\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.396\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.979\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.152\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBEVFormer [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e44.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e53.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.582\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.256\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.375\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.378\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e1.123\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePointpillars [\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e30.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e45.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.517\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.316\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.368\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSECOND [\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e52.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e63.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCenterPoint [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e60.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e67.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.262\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.239\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.361\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.288\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.136\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePointAugmenting [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e66.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e71.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.253\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.235\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.354\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.266\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.123\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMVP [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e66.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e70.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.263\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.238\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.321\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.313\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.134\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTransFusion [\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e68.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e71.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.259\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.243\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.359\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.288\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.127\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCMT [\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e70.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e73.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.299\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.241\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.323\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.240\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.112\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepInteraction [\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e70.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e73.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.257\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.240\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.325\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.245\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.128\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBEVFusion [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e71.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e73.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.250\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.240\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.359\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.254\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.132\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDS-BEV(Ours)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC\u0026thinsp;+\u0026thinsp;L\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e71.5\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e73.6\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.251\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.238\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.345\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.255\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.122\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Ablation Studies and Discussions\u003c/h2\u003e \u003cp\u003eIn order to test the effectiveness of each module, Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e is the ablation study of FSM ,C denotes Channel attention,S denotes spacial attention .Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e describes the ablation experiment of our module. D denotes that we use the DCA module, and S denotes our SA module. The mAP and NDS in the table show the effectiveness of our module.Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e is the number of relevant points,it was find when the number of relevant points is 8,the model has the best performance .\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eAblation study of DS-BEV.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eFSM\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eFRM\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003emAP(%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eNDS(%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eS\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eD\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eS\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e✘\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e✘\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e✘\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e✘\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e68.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e71.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e✔\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e✘\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e✘\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e✘\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e68.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e71.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e✔\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e✔\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e✘\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e✘\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e68.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e71.8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e✔\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e✔\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e✔\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e✘\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e69.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e72\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e✔\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e✔\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e✔\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e✔\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e69.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e72.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eAblation study of related point.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eR-P\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003emAP\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNDS\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e68.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e71.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e68.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e72\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e69.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e71.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"5 Conclusions","content":"\u003cp\u003eIn this paper, we propose a fusion method of camera features and Lidar features in BEV mode to improve the accuracy of model detection. By combining the weights of channel attention and spatial attention through a specially designed fusion module for attention, enhancing the expression of key features, and combining DCA and self-attention to enhance local and overall coordination, our method can show advanced performance on the challenging nuScenes data machine. Our paper still has follow-up tasks. Experiments on different basic networks prove that our model is robust and universal.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eAll the work is done by H.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eMEI J, ZHU A, YAN X, et al. Waymo Open Dataset: Panoramic Video Panoptic Segmentation[J].\u003c/li\u003e\n\u003cli\u003eLIANG M, YANG B, WANG S, et al. Deep Continuous Fusion for Multi-Sensor 3D Object Detection[M/OL]//Computer Vision \u0026ndash; ECCV 2018,Lecture Notes in Computer Science. 2018: 663-678. http://dx.doi.org/10.1007/978-3-030-01270-0_39. DOI:10.1007/978-3-030-01270-0_39.\u003c/li\u003e\n\u003cli\u003eLiang, Ming, et al. \u0026ldquo;Multi-Task Multi-Sensor Fusion for 3D Object Detection.\u0026rdquo; 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, https://doi.org/10.1109/cvpr.2019.00752.\u003c/li\u003e\n\u003cli\u003eNabati, Ramin, and Hairong Qi. \u0026ldquo;CenterFusion: Center-Based Radar and Camera Fusion for 3D Object Detection.\u0026rdquo; 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, https://doi.org/10.1109/wacv48630.2021.00157.\u003c/li\u003e\n\u003cli\u003eXie, Liang, et al. \u0026ldquo;PI-RCNN: An Efficient Multi-Sensor 3D Object Detector with Point-Based Attentive Cont-Conv Fusion Module.\u0026rdquo; Cornell University - arXiv,Cornell University - arXiv, Nov. 2019.\u003c/li\u003e\n\u003cli\u003eZhang, Haolin, et al. \u0026ldquo;Faraway-Frustum: Dealing with Lidar Sparsity for 3D Object Detection Using Fusion.\u0026rdquo; 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021, https://doi.org/10.1109/itsc48978.2021.9564990.\u003c/li\u003e\n\u003cli\u003ePang, Su, et al. \u0026ldquo;CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection.\u0026rdquo; 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, https://doi.org/10.1109/iros45743.2020.9341791.\u003c/li\u003e\n\u003cli\u003eGe, Chongjian, et al. MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation. Apr. 2023.\u003c/li\u003e\n\u003cli\u003eLiang, Tingting, et al. BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework.\u003c/li\u003e\n\u003cli\u003eLiu, Zhijian, et al. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird\u0026rsquo;s-Eye View Representation.\u003c/li\u003e\n\u003cli\u003eHaotian, Haotian, et al. EA-BEV: Edge-Aware Bird\u0026rsquo; s-Eye-View Projector for 3D Object Detection. Mar. 2023.\u003c/li\u003e\n\u003cli\u003eZhu, Xizhou, et al. \u0026ldquo;Deformable DETR: Deformable Transformers for End-to-End Object Detection.\u0026rdquo; arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Oct. 2020.\u003c/li\u003e\n\u003cli\u003eCaesar, Holger, et al. \u0026ldquo;nuScenes: A Multimodal Dataset for Autonomous Driving.\u0026rdquo; 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, https://doi.org/10.1109/cvpr42600.2020.01164.\u003c/li\u003e\n\u003cli\u003eChen, Xiaozhi, et al. \u0026ldquo;Multi-View 3D Object Detection Network for Autonomous Driving.\u0026rdquo; 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, https://doi.org/10.1109/cvpr.2017.691.\u003c/li\u003e\n\u003cli\u003eQi, Charles R., et al. \u0026ldquo;Frustum PointNets for 3D Object Detection from RGB-D Data.\u0026rdquo; 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, https://doi.org/10.1109/cvpr.2018.00102.\u003c/li\u003e\n\u003cli\u003eWang, Zhixin, and Kui Jia. \u0026ldquo;Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection.\u0026rdquo; 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, https://doi.org/10.1109/iros40897.2019.8968513.\u003c/li\u003e\n\u003cli\u003eNabati, Ramin, and Hairong Qi. \u0026ldquo;CenterFusion: Center-Based Radar and Camera Fusion for 3D Object Detection.\u0026rdquo; 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, https://doi.org/10.1109/wacv48630.2021.00157.\u003c/li\u003e\n\u003cli\u003eChen, Xuanyao, et al. FUTR3D: A Unified Sensor Fusion Framework for 3D Detection.\u003c/li\u003e\n\u003cli\u003eBai, Xuyang, et al. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers.\u003c/li\u003e\n\u003cli\u003eVora, Sourabh, et al. \u0026ldquo;PointPainting: Sequential Fusion for 3D Object Detection.\u0026rdquo; 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, https://doi.org/10.1109/cvpr42600.2020.00466.\u003c/li\u003e\n\u003cli\u003eWang, Chunwei, et al. \u0026ldquo;PointAugmenting: Cross-Modal Augmentation for 3D Object Detection.\u0026rdquo; 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, https://doi.org/10.1109/cvpr46437.2021.01162.\u003c/li\u003e\n\u003cli\u003eYin, Tianwei, et al. \u0026ldquo;Multimodal Virtual Point 3D Detection.\u0026rdquo; Cornell University - arXiv,Cornell University - arXiv, Dec. 2021.\u003c/li\u003e\n\u003cli\u003eXu, Shaoqing, et al. \u0026ldquo;FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection.\u0026rdquo; 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021, https://doi.org/10.1109/itsc48978.2021.9564951.\u003c/li\u003e\n\u003cli\u003eChen, Zehui, et al. \u0026ldquo;AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection.\u0026rdquo; Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022, https://doi.org/10.24963/ijcai.2022/116.\u003c/li\u003e\n\u003cli\u003eChen, Yukang, et al. Focal Sparse Convolutional Networks for 3D Object Detection. Apr. 2022.\u003c/li\u003e\n\u003cli\u003eLiang, Ming, et al. \u0026ldquo;Deep Continuous Fusion for Multi-Sensor 3D Object Detection.\u0026rdquo; Computer Vision \u0026ndash; ECCV 2018,Lecture Notes in Computer Science, 2018, pp. 663\u0026ndash;78, https://doi.org/10.1007/978-3-030-01270-0_39.\u003c/li\u003e\n\u003cli\u003eLi, Yingwei, et al. DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection.\u003c/li\u003e\n\u003cli\u003eHaotian, Haotian, et al. EA-BEV: Edge-Aware Bird\u0026rsquo; s-Eye-View Projector for 3D Object Detection. Mar. 2023.\u003c/li\u003e\n\u003cli\u003eGe, Chongjian, et al. MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation. Apr. 2023.\u003c/li\u003e\n\u003cli\u003eXie, Yichen, et al. SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection.\u003c/li\u003e\n\u003cli\u003eXia, Zhuofan, et al. Vision Transformer with Deformable Attention.\u003c/li\u003e\n\u003cli\u003eLi, Zhiqi, et al. \u0026ldquo;BEVFormer: Learning Bird\u0026rsquo;s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers.\u0026rdquo; Lecture Notes in Computer Science,Computer Vision \u0026ndash; ECCV 2022, 2022, pp. 1\u0026ndash;18, https://doi.org/10.1007/978-3-031-20077-9_1.\u003c/li\u003e\n\u003cli\u003eZou, Jiayu, et al. DiffBEV: Conditional Diffusion Model for Bird\u0026rsquo;s Eye View Perception. Mar. 2023.\u003c/li\u003e\n\u003cli\u003eVaswani, Ashish, et al. \u0026ldquo;Attention Is All You Need.\u0026rdquo; Neural Information Processing Systems,Neural Information Processing Systems, June 2017.\u003c/li\u003e\n\u003cli\u003eHuang, Junjie, et al. BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View.\u003c/li\u003e\n\u003cli\u003eLi, Yinhao, et al. BEVDepth: Acquisition of Reliable Depth for Multi-View 3D Object Detection.\u003c/li\u003e\n\u003cli\u003eLin, Liting, et al. \u0026ldquo;SwinTrack: A Simple and Strong Baseline for Transformer Tracking.\u0026rdquo; arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Dec. 2021.\u003c/li\u003e\n\u003cli\u003eZhou, Yin, and Oncel Tuzel. \u0026ldquo;VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection.\u0026rdquo; 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, https://doi.org/10.1109/cvpr.2018.00472.\u003c/li\u003e\n\u003cli\u003eHe, Kaiming, et al. \u0026ldquo;Deep Residual Learning for Image Recognition.\u0026rdquo; 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, https://doi.org/10.1109/cvpr.2016.90.\u003c/li\u003e\n\u003cli\u003eYin, Tianwei, et al. \u0026ldquo;Center-Based 3D Object Detection and Tracking.\u0026rdquo; 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, https://doi.org/10.1109/cvpr46437.2021.01161.\u003c/li\u003e\n\u003cli\u003eLi, Yanwei, et al. Unifying Voxel-Based Representation with Transformer for 3D Object Detection. June 2022.\u003c/li\u003e\n\u003cli\u003ePan, Liang, et al. Multi-View Partial (MVP) Point Cloud Challenge 2021 on Completion and Registration: Methods and Results.\u003c/li\u003e\n\u003cli\u003eBai, Xuyang, et al. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers.\u003c/li\u003e\n\u003cli\u003eYang, Zeyu, et al. DeepInteraction: 3D Object Detection via Modality Interaction. Aug. 2022.\u003c/li\u003e\n\u003cli\u003eHuang, Junjie, et al. BEVDet4D: Exploit Temporal Cues in Multi-Camera 3D Object Detection.\u003c/li\u003e\n\u003cli\u003eLang, Alex H., et al. \u0026ldquo;PointPillars: Fast Encoders for Object Detection from Point Clouds.\u0026rdquo; 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, https://doi.org/10.1109/cvpr.2019.01298.\u003c/li\u003e\n\u003cli\u003eYan, Yan, et al. \u0026ldquo;SECOND: Sparsely Embedded Convolutional Detection.\u0026rdquo; Sensors, Oct. 2018, p. 3337, https://doi.org/10.3390/s18103337.\u003c/li\u003e\n\u003cli\u003eYan, Junjie, et al. Cross Modal Transformer: Towards Fast and Robust 3D Object Detection. Jan. 2023.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"unified representation, feature refinement, feature selection","lastPublishedDoi":"10.21203/rs.3.rs-4477033/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4477033/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eMulti-sensor fusion is essential for an accurate and reliable autonomous driving system. Recently, BEVFusion has been proposed to integrate LiDAR and image features in a unified Bird's Eye View (BEV) representation. However, there is an issue with the loss of local image information during the extraction of global image information on the backbone network. In order to fully integrate local features with global features, this paper proposes a network called DS-BEV based on feature selection and refinement. It includes a Feature Selection Fusion module (FSM) and a Feature Refinement module (FRM). In the FSM, the features of different modal are first extracted by using specific networks and projected into a unified BEV representation space. Through channel and spatial learning, important information is selected from the initial features and fused to generate preliminary fusion features. Then,the image features extracted by a CNN network and the preliminary fusion features output by the FSM are sent to the FRM together. By combining the local features generated by CNN network, the fusion features are refined. We evaluate our model on the nuScenes dataset. Experiments show that our DS-BEV achieves 69.5% mAP and 72.3% NDS in detection accuracy.\u003c/p\u003e","manuscriptTitle":"DS-BEV:An Efficient Multi-Modal Fusion in Object Detection with Unified Bird's-Eye View Representation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-05-30 14:30:15","doi":"10.21203/rs.3.rs-4477033/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"fc821a28-9c36-4265-a481-8ee376ecb759","owner":[],"postedDate":"May 30th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-05-30T14:30:15+00:00","versionOfRecord":[],"versionCreatedAt":"2024-05-30 14:30:15","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4477033","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4477033","identity":"rs-4477033","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.