A Generalizable Vision-Based Framework for Vehicle Trajectory Estimation and Conflict Analysis at Intersections

doi:10.21203/rs.3.rs-7916322/v1

A Generalizable Vision-Based Framework for Vehicle Trajectory Estimation and Conflict Analysis at Intersections

2025 · doi:10.21203/rs.3.rs-7916322/v1

preprint OA: closed

Full text JSON View at publisher

Full text 148,892 characters · extracted from preprint-html · click to expand

A Generalizable Vision-Based Framework for Vehicle Trajectory Estimation and Conflict Analysis at Intersections | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A Generalizable Vision-Based Framework for Vehicle Trajectory Estimation and Conflict Analysis at Intersections Swaranjit Roy, Ahmed Abdelhadi, Ph.D., Sherif M. Gaweesh, Ph.D., P.E., RSP This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7916322/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The lack of scalable and cost-effective methods for extracting actionable vehicle trajectories from existing traffic CCTV infrastructure limits proactive traffic safety analysis. Traditional trajectory estimation approaches often rely on LiDAR, radar, or calibrated camera systems, which are costly and lack scalability. This study introduces a novel, plug-and-play framework for vision-based vehicle trajectory estimation using monocular CCTV footage, eliminating the need for camera calibration. The proposed system combines homography-based Bird Eye View (BEV) transformation with a You Look Only Once (YOLO) v8-Oriented Bounding Box (OBB) detection to estimate vehicle trajectories from traffic footage trained on a custom dataset. The framework introduces a novel custom-defined “space” bounding box that accurately captures the physical footprint of moving objects. It leverages visual cues, such as tire shadows and distortion patterns, effectively addressing challenges related to occlusion and distortions. The YOLOv8-OBB model, trained on the compiled dataset, achieves high performance with Mean Average Precision (mAP) @50–95 of 0.92, precision and recall exceeding 0.95. Trajectory refinement was achieved through temporal sub-sampling, moving average smoothing, and slope-based orientation correction resulting in stable and physically realistic paths even during turns and visual occlusions. Calculated speed and acceleration profiles from refined trajectories align with real-world driving behavior, further validating the system’s accuracy. The pipeline was successfully tested on an unseen intersection demonstrating its generalizability across varied traffic geometries and perspectives. This work presents a scalable, calibration-free solution for trajectory-based traffic monitoring, with potential applications in conflict detection, traffic modeling, and intersection safety assessments using widely available surveillance infrastructure. Civil Engineering Artificial Intelligence and Machine Learning Trajectory Estimation Traffic Safety Analysis YOLOv8-Oriented Bounding Box (OBB) Vehicle Detection and Tracking Intelligent Transportation Systems (ITS) Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 INTRODUCTION Intersections are among the most crash-prone areas due to the complexity of maneuvers like turning, merging, and crossing, combined with time-sensitive decisions ( 1 , 2 ). In the U.S., intersections account for nearly 40% of all crashes, 50% of serious injuries, and around 20% of fatalities ( 3 , 4 ). Given their high risk, intersection safety has traditionally been evaluated using historical crash data. However, this reactive approach is limited by underreporting, lack of near-miss data, small sample sizes, and delayed feedback ( 5 , 6 ). To overcome these issues, recent research emphasizes proactive conflict analysis over crash-based evaluation ( 2 , 4 , 7 ). Conflict analysis focuses on near-miss events incidents that could have resulted in crashes had evasive actions not been taken ( 2 ). These events occur more frequently than actual collisions and offer critical insight into unsafe interactions. Accurate detection of such events requires precise, continuous trajectory data ( 5 , 8 ). Common sources for trajectory estimation include UAV footage, remote sensing imagery, GPS, LiDAR point clouds, onboard vehicle sensors, and CCTV video ( 7 , 9 – 11 ). Traffic surveillance cameras are now widely deployed at intersections to monitor traffic, and advances in computer vision have made CCTV footage a scalable resource for safety management ( 7 , 11 – 13 ). Compared to other data sources, CCTV offers practical benefits continuous operation, low cost, widespread coverage, and no need for in-vehicle instrumentation. As a result, recent research increasingly leverages CCTV streams for proactive safety assessments ( 14 – 16 ). Various computer vision models have been applied to detect and track road users in CCTV footage ( 17 – 22 ), with monocular 3D-based methods gaining popularity( 21 , 22 ). However, Monocular 3D object detection from CCTV footage is challenging due to the absence of depth information ( 18 , 22 ). Most approaches depend on prior knowledge of object size and camera calibration to map 2D images to 3D space using annotated datasets ( 18 , 19 ), however such requirements are often unmet in traffic footage, where calibration data and 3D annotations are typically unavailable ( 18 ). To address the limitations of monocular CCTV footage, Homography Transformation (HT) is often used to generate 2D Bird’s-Eye View (BEV) projections, enabling more reliable spatial analysis ( 17 – 19 ). BEV mimics human spatial reasoning by emphasizing relative positions and movement patterns over exact depth, offering a map-like perspective that simplifies interpretation. It also reduces occlusion, prevents bounding box overlap, and allows accurate spatial measurement beneficial for trajectory analysis and conflict detection ( 7 , 22 ). However, BEV transformations from traffic CCTV footage face challenges. HT-generated BEV images often contain distortions and occlusions, unlike true top-down views from aerial imager ( 17 , 18 , 22 ). Inverse Perspective Mapping (IPM), a common HT method, stretches distant objects, distorting vehicle shapes and complicating detection ( 19 ). While some studies have proposed enhancements for object detection in warped BEV images ( 23 , 24 ), existing methods lack generalizability and are not easily transferable to other intersections, limiting their scalability. This study presents a novel and generalizable framework for BEV-based trajectory estimation from traffic CCTV footage, addressing the limitations of existing calibration-dependent and intersection-specific methods. The proposed approach introduces a custom-defined "space" estimation box that infers spatial relationships using visual cues such as vehicle headlights, tire positions, and shadows, thereby eliminating the need for intrinsic or extrinsic camera calibration. It integrates a customized YOLOv8-OBB detection model that operates directly on BEV-transformed images and maintains robust performance across varied intersection geometries. By linking detections over time, the framework reconstructs accurate vehicle trajectories and enables precise estimation of speed and acceleration in BEV space. This approach enhances scalability, reduces deployment complexity, and supports large-scale, proactive traffic safety analysis. LITERATURE REVIEW Traffic video footage has been widely explored for conflict analysis. One study used a clustering-based approach to identify and adapt conflicting trajectory pairs ( 12 ). Another developed an automated computer vision tool that significantly reduced rear-end, merging, and total conflicts ( 11 ). A separate study combined UAV data with a convolutional neural network to improve road user detection, reporting substantial gains in accuracy ( 7 ). These efforts have advanced video analytics in traffic conflict analysis and continue to guide research in this area. However, accurate trajectory estimation from standard CCTV footage remains an underexplored challenge, and addressing it could improve the reliability and scalability of conflict detection methods ( 25 ). Monocular 3D detection is widely studied for CCTV footage, but the absence of depth-sensing hardware or stereo vision remains a key limitation ( 19 ). Depth must be inferred indirectly using visual cues such as object scale, perspective distortion, motion patterns, and scene geometry ( 21 , 26 , 27 ). Some methods enforce geometric consistency between 2D and 3D by aligning box dimensions and orientations ( 28 , 29 ), while others use monocular depth estimation networks to predict pixel-wise depth maps from single images ( 30 , 31 ). These approaches, however, require 3D ground truth for supervised learning. Without it, models struggle to generalize, especially in complex urban scenes, and are further hindered by occlusion, lighting variation, and scene complexity ( 32 ). Additionally, they struggle with occlusions, lighting variation, and urban complexity. CenterLoc3D is a 3D object localization framework that bypasses traditional 2D detection by directly predicting vehicle centroids, bounding box vertices, and 3D dimensions using a multi-scale fusion module and constrained loss function ( 21 ). While accurate in real time, it depends on camera calibration, limiting use in uncalibrated traffic scenes. Another approach combines 3D box depth-ordering with an LSTM motion model for robust tracking but also requires stereo or calibrated sensors ( 33 ). A separate study introduced a large-scale dataset to train Cube using R-CNN, though it faced challenges with detection drift and orientation errors ( 29 ). UAV-based data collection offers a promising alternative to monocular 3D tracking by providing top-down views that eliminate perspective distortion and ensure consistent scale ( 7 , 34 , 35 ). However, UAVs face limitations such as short battery life and FAA restrictions on flights over populated areas, making them impractical for continuous urban monitoring ( 34 ). These methods also rely on stable, high-altitude views that differ significantly from ground-mounted CCTV footage, which typically captures oblique angles. This geometric mismatch limits the direct applicability of UAV-trained models to CCTV data, which is more affected by occlusion, scale variation, and distortion ( 22 ). Several studies have attempted to convert CCTV footage into UAV-like top-down BEV using homography transformation, typically via vanishing point estimation or landmark-based methods. Vanishing points assume straight, unobstructed roads and vehicle-road alignment, making them unreliable in intersections or curved scenes ( 18 , 31 ). Landmark-based methods align image features (e.g., crosswalks, stop lines) with reference imagery, offering better robustness in complex settings ( 22 , 31 , 36 ). One study proposed tailed rotated box regression, linking vehicle base centers to extended BEV shapes using a "tail" vector, but it requires accurate 3D boxes, class-specific box dimensions, and customized YOLO outputs ( 18 ). Other studies combined vanishing points with homography, but struggled in complex traffic environments ( 10 , 24 , 30 ). Another applied IMU-based motion correction with homography for front-view BEV projection, though it depends on sensor data and is limited to highway-like scenarios ( 37 ). A study proposed 3D-Net for efficient 3D object recognition, Multi-Class Object Tracking (MCOT) under occlusion, and Sliding Grid Inverse Perspective Mapping for automatic calibration to reduce distortion ( 31 ). However, its tracker depends on class-based identity assignment, limiting generalizability. YOLO and CenterTrack were evaluated for road user detection and trajectory estimation from CCTV footage ( 14 ). CenterTrack, effective for low-elevation cameras (5–15 ft), jointly detects and tracks via inter-frame displacement prediction. YOLOv7 was preferred for higher viewpoints due to stronger detection. Speed and orientation were estimated using methods from ( 38 ). However, CenterTrack’s performance degrades at elevated views, and YOLOv7 lacks built-in 3D detection or transformation, leading to inaccurate spatial estimates. BEV box orientations often misalign with vehicle paths, reducing reliability at intersections. Moreover, most existing studies rely on base YOLO variants, which do not support oriented bounding boxes an important limitation for capturing complex vehicle maneuvers in intersection scenarios ( 39 – 41 ). Despite numerous studies on BEV transformation and trajectory estimation from CCTV footage, existing work often falls short when it comes to direct integration with widely used open-source detection models like YOLO. This lack of compatibility limits practical applicability and reduces scalability, making it challenging for broader adoption in real-world traffic analysis tasks. DATA COLLECTION AND STUDY SITE Traffic CCTV footage was collected from a public livestream of the Four Corners intersection in downtown Coldwater, Michigan a signalized, high-volume site frequently used by large trucks and conveyors, offering diverse vehicle types for analysis ( 42 ). Its public availability supports transparency and reproducibility. Recordings were captured using OBS Studio from 6:00 AM to 6:00 PM on alternating days over one week at 30 fps, yielding 36 hours of footage. Selected frames were converted to BEV using Homography Transformation (HT) and balanced across vehicle classes to ensure robust model training. Annotation was performed using Roboflow for its intuitive interface, YOLO format support, and built-in tools for augmentation, versioning, and efficient dataset export ( 43 ). Dataset for Multi-Class Road User Detection (MCRUD) from BEV Transformed Imagery To address the lack of spatially grounded annotations in existing datasets, a custom BEV dataset was developed using CCTV video frames. Most public datasets are tailored to natural perspective views and do not capture the physical footprint of road users in top-down imagery, limiting their applicability for BEV-based detection and trajectory analysis. Additionally, prior studies often rely solely on class-based annotations (e.g., cars, pedestrians), without representing the actual space occupied, critical for trajectory reconstruction and interaction assessment. The developed dataset includes six classes: cars, trucks, bicycles, pedestrians, objects, and space. For each dynamic road user (car, truck, bicycle, pedestrian), two bounding boxes were annotated: ( 1 ) the object class and ( 2 ) the physical space it occupies. Static infrastructure elements (e.g., utility poles) were labeled as "objects" and paired with space boxes to ensure consistency in spatial representation. These "space" annotations are essential for accurate speed estimation, conflict analysis, and surrogate safety metrics. The final dataset contains 564 annotated images within the recommended 500–1000 image range for object detection ( 44 ). It supports BEV-specific model development and enhances generalizability by reflecting a transferable spatial structure suitable for varied intersections and camera angles. METHODOLOGY The following subsections outline key components of the proposed methodology. HT and Camera Calibration describes how raw CCTV footage is transformed into a top-down BEV view using point-based mapping with aerial imagery. It also details a manual process for estimating pixel-to-meter ratios, enabling real-world measurements without intrinsic or extrinsic camera parameters. Bounding Boxes for Defining Vehicle Space introduces a novel feature-guided method that estimates vehicle footprints in BEV images using cues such as tire positions, headlights, distortion slopes, and shadows allowing adaptation to different vehicle orientations and movements beyond fixed-dimension bounding boxes. Object Detection details a custom-trained YOLOv8-OBB model designed to detect oriented vehicles directly in BEV space. Dynamic Object Tracking explains the use of ByteTrack to associate detections across frames and maintain consistent vehicle identities. Lastly, Trajectory Estimation and Refinement covers the extraction, smoothing, and refinement of vehicle paths to calculate motion parameters crucial for conflict and safety analysis. Homography Transformation and Camera Calibration To generate the BEV transformation, a homography matrix was computed by manually selecting matching points between the original camera view and a reference top-down image Fig. 1 (a) and (b) were selected. Using these point pairs, OpenCV’s cv2.findHomography() function estimated the projective transformation matrix H which maps coordinates from the camera view to BEV space via x′=Hx, where x and x′ are corresponding points in the source and destination images, respectively ( 45 ). This matrix was applied to each video frame using cv2.warpPerspective() for consistent projection into BEV (Fig. 1 (d)), with the same transformation applied across the sequence using a saved .npy file. Figure 1 (c) shows the aligned overlay of the CCTV and satellite images, confirming calibration accuracy. A manual point-based calibration method was developed to estimate the pixel-to-meter ratio for real-world spatial analysis. Using a static CCTV frame, known-distance point pairs were annotated, and their pixel distances computed. A grid search (starting at 1.0, incremented by 0.1) identified the ratio minimizing Mean Absolute Error (MAE) between pixel and real-world distances. This approach enables accurate calibration without requiring intrinsic or extrinsic camera parameters, making it ideal for uncalibrated traffic surveillance footage. Bounding Boxes to Define Vehicle Space To enhance spatial estimation accuracy, vehicle movements were categorized by directional orientation specifically north–south and east–west due to variation in BEV appearance caused by motion direction and camera angle. This direction-specific approach is especially useful at intersections, where maneuvers and perspective distortions vary. Tailoring estimation to each direction ensures that bounding boxes more accurately represent the space occupied by vehicles. Trajectories were then estimated by tracking the center point of this space box across frames, offering a more accurate reflection of actual vehicle presence than visible footprint alone. Figure 2 illustrates the step-by-step process for defining the spatial bounding box. The yellow line marks the reference edge, green lines guide distortion alignment, and blue lines indicate estimated vehicle width. For vehicles moving north or south in a straight path, as shown in Fig. 2 (a) and (b) , the process begins by drawing a horizontal line along the shadow edge and a vertical line between the two visible tires. Their intersection forms Point 1, the starting corner. Sloped lines are then drawn through each headlight, aligned with the vehicle’s distortion, forming Points 2 and 3 where they intersect the initial lines. A vertical line from Point 2 and a horizontal from Point 3 intersect to form Point 4. Connecting Points 1, 2, 4, and 3 yields the final spatial bounding box, accurately capturing the vehicle’s distorted footprint in the BEV view. For vehicles moving east or west in straight path, as shown in Fig. 2 (c) and (d) , the spatial bounding box was constructed using reference lines and average vehicle dimensions. A horizontal line was drawn along the tire alignment and a vertical line along the shadow edge, intersecting at Point 1, the starting corner. An angled line following the vehicle’s distortion and passing through the visible headlight intersected the horizontal line to form Point 2. Using known or estimated vehicle width, perpendicular lines were extended from the axis defined by Points 1 and 2 to locate Points 3 and 4. Together, these four points defined the final spatial bounding box. For turning movements of vehicles going north or south (Fig. 2 (e) and (f) ), the spatial bounding box was constructed using tire and shadow alignment. A line was drawn between the visible tires, and a perpendicular line along the shadow edge. Their intersection marked Point 1. Two diagonal lines were then traced along the outer edges of the vehicle’s distortion, intersecting the initial lines to form Points 2 and 3. Finally, a line parallel to the tire line through Point 2 and a line parallel to the shadow line through Point 3 intersected at Point 4. Connecting all four points completed the bounding box. For turning movements of vehicle going east or west (Fig. 2 (g) and (h) ), the spatial bounding box was constructed using two perpendicular reference lines: one along the tire alignment and the other along the shadow edge. Their intersection marked Point 1. A diagonal line following the vehicle’s distortion slope and passing through the headlight intersected a reference line to form Point 2. The vehicle’s width was then estimated using top-down reference imagery and used to project Points 3 and 4 perpendicular to the line between Points 1 and 2, completing the bounding box. Development and Optimization of the Custom Object Detection Algorithm Accurate detection in BEV-transformed frames requires custom training, as most pre-trained models are optimized for natural perspective views. This study trained a detection model using feature-guided spatial bounding boxes tailored to BEV geometry, enabling more precise representation of vehicle footprints an advancement overlooked in prior studies. The YOLOv8 architecture, released in January 2024, was selected for its native support for oriented bounding boxes (OBB), essential for capturing object orientation during turning maneuvers ( 46 ). YOLOv8-OBB was preferred over YOLOv11-OBB due to its proven stability, thorough benchmarking, strong documentation, and wide adoption at the time ( 47 ). In 2023 alone, over 19 million YOLOv8 models were trained, with 15 billion object detections recorded and 4 million downloads in December Click or tap here to enter text.( 47 ). The use of this YOLOv8-OBB model specifically designed to handle BEV and aerial imagery presents a key contribution of this paper, distinguishing it from existing works that did not leverage such advanced architecture. In this study, the YOLOv8-OBB model was custom-trained on the BEV Road User Detection Dataset using Google Colab’s T4 GPU. Training ran for up to 500 epochs with early stopping applied after 50 epochs of no improvement, following best practices in recent YOLO research ( 48 – 50 ). This patience setting helps prevent overfitting while maintaining computational efficiency ( 49 ). Dynamic Object Tracking and Trajectory Estimation This study uses ByteTrack, a state-of-the-art multi-object tracking algorithm known for high accuracy and efficiency ( 51 ). By associating both high- and low-confidence detections, ByteTrack improves tracking continuity under occlusions and missed frames. Unlike SORT, which discards low-confidence boxes and suffers from identity switches, ByteTrack offers more stable tracking ( 52 ). Compared to DeepSORT, it avoids computationally expensive feature extraction while maintaining competitive accuracy. Its simplicity, compatibility with YOLO detectors, and low overhead make it suitable for real-time applications. To enhance reliability, trajectory points were sampled every 5th frame and smoothed using a moving average filter (window size = 3). This configuration, chosen after testing intervals of 2–10 frames and windows of 2–6, balanced jitter reduction with preservation of motion detail crucial for accurate speed and acceleration estimation. Camera Calibration and Validation The calibration technique in this study estimates the optimal pixel-to-meter (px/m) ratio by manually selecting known-distance reference points in a BEV-transformed image. Users define multiple reference lines by clicking two points and inputting their real-world distance (in meters). Pixel distances are then computed, and a range of px/m ratios is evaluated by calculating the Mean Absolute Error (MAE) between estimated and actual distances. The optimal ratio is identified as the value that minimizes overall MAE. In this case, an iterative search (starting at 1.0 px/m with 0.1 increments) yielded an optimal value of 19.2 px/m. Reference points, such as lane edges, were selected for their visibility and consistency across satellite and BEV images, minimizing alignment errors. Calibration accuracy was further assessed by computing percentage errors, as reported in Table 1 . To validate the accuracy of the calibrated BEV-transformed images, a quantitative analysis was performed using ground-truth distances from a satellite image. Table 1 with their real-world lengths and corresponding measurements in the BEV image. The percentage error for each line was calculated using the formula: % Error = \(\:\frac{\text{D}\text{i}\text{m}\text{e}\text{n}\text{s}\text{i}\text{o}\text{n}\:in\:BEV\:Perspective\:-\:\text{D}\text{i}\text{m}\text{e}\text{n}\text{s}\text{i}\text{o}\text{n}\:in\:Sateline\:Image}{\text{D}\text{i}\text{m}\text{e}\text{n}\text{s}\text{i}\text{o}\text{n}\:in\:Sateline\:Image}\) x 100 The results show an average percentage error of 0.84%, with a maximum deviation of only 1.67% well within the commonly accepted 2% threshold for homography-based spatial calibration ( 53 ). This confirms that the BEV transformation maintains geometric fidelity, supporting reliable estimation of speed, acceleration, and vehicle spacing. Table 1 Percentage Error Between Satellite and BEV Image Measurements Lines Dimension % Error Satellite (True) BEV (Observed) AB 38.5 38.76 0.68 CD 38.7 38.91 0.54 EF 12.2 12.24 0.33 GH 10.2 10.37 1.67 GI 33.8 34.13 0.98 To further validate the transformation, vehicle dimensions estimated from BEV frames were compared to real-world specifications. Typical SUVs measure 4.8 ± 0.15 m (length) and 1.9 ± 0.1 m (width), while sedans are around 4.4 ± 0.2 m by 1.75 ± 0.08 m. A two-sample t-test showed no significant difference between BEV-estimated and actual dimensions (t( 8 , 8 ) = 0.00, p < 0.01), confirming that the spatial bounding boxes reliably capture real-world vehicle sizes. RESULTS Model Training, Validation and Detection Performance Figure 3 (a) presents training and validation plots for the YOLOv8-Oriented Bounding Box (OBB) model over ~ 230 epochs. The top row shows training metrics, while the bottom row displays validation performance, offering a complete view of model learning in BEV-transformed images. The model showed consistent convergence across all key loss components: box loss (bounding box accuracy), classification loss (label prediction), and Distribution Focal Loss (DFL), which improves boundary precision through continuous label distributions. Box loss decreased from 2.5 to 3.0 to 0.3, and DFL from 2.9 to 1.5, then stabilized indicating healthy learning progression. Validation losses closely followed training trends, confirming strong generalization and no signs of overfitting. Model effectiveness was further validated using standard evaluation metrics. Precision improved from 50% to over 95%, and recall from 25% to above 97%, indicating reduced false positives and significantly improved detection coverage. The model achieved near-perfect detection accuracy, with mAP@50 reaching 99% and mAP@50–95 at 92% demonstrating strong performance under both lenient and strict IoU thresholds. Figure 3 (b) and (c) show the confusion matrices for multi-class classification on the BEV-transformed dataset. The raw matrix reflects absolute counts, while the normalized matrix shows per-class accuracy. The model correctly classified 218 cars, 239 space instances, and 67 objects. Misclassifications were minimal, such as 11 space instances mislabeled as background and 13 as trucks. A few rare errors occurred between pedestrians and trucks. The normalized matrix shows near-perfect accuracy (1.00) for most classes, including bicycle, car, object, and pedestrian. The space class, critical for trajectory estimation, was classified correctly 98% of the time. Remaining confusion such as trucks misidentified as space (48%) and 2% of space as background likely stems from overlapping footprints in BEV views. Figure 4 illustrates temporal consistency and variability in detection confidence across frames in a multi-class traffic scene. The model accurately detects all annotated classes cars, trucks, bicycles, pedestrians, static objects, and spatial extents of moving vehicles. One clear trend observed in Fig. 4 (b) confidence scores remain low when vehicles are partially visible (e.g., 0.43 and 0.47) and rise above 0.85 once the vehicle fully enters the frame. This suggests the model relies on complete visual cues like headlights, tires, and contours for accurate space detection. This dependency highlights the importance of full vehicle visibility for reliable trajectory initialization, conflict analysis, and real-time monitoring. Figure 4 (b) and (d) , reveal that pedestrian and bicycle classes consistently receive lower confidence scores than larger vehicles. This is due to their smaller size in BEV frames resulting in fewer pixels and their underrepresentation in the training dataset. These findings highlight the model’s strength in detecting fully visible, larger road users, while also identifying the need for improved early detection and better class balance during training. Though occasional misclassifications occur between visually similar classes, they are rare and do not significantly impact tasks like trajectory estimation or safety analysis. Trajectory Analysis for Speed and Acceleration Estimation Trajectory estimation is essential for deriving motion characteristics like speed and acceleration, which underpin conflict analysis and behavioral assessment in traffic scenes. However, raw trajectories often exhibit jitter, noise, and orientation errors especially during turning due to occlusions, oblique views, and inconsistent bounding box centers. To address these issues, a three-step post-processing framework was implemented. First, trajectory points were sampled every 5th frame to reduce noise and suppress transient errors. Second, a moving average filter (window size = 3) was applied to smooth the path while preserving directional changes. Third, orientation correction was performed by calculating the trajectory slope and aligning the bounding box angle with the true movement direction. This process significantly improved the temporal and spatial consistency of trajectories, enabling accurate speed and acceleration estimation. Frame-by-frame analysis of speed and acceleration, commonly used in conflict studies, serves as a practical validation method where ground-truth dynamics are unavailable, helping assess the realism of derived motion profiles. Figure 5 presents speed and acceleration profiles of three vehicles engaged in distinct maneuvers: (a) a straight-moving vehicle, (b) a left-turning vehicle, and (c) a right-turning vehicle. The data, sampled every 10th frame, reflects realistic driving behavior and serves as a qualitative validation of the computed motion parameters. The straight-moving vehicle (Fig. 5 (a) ) shows a sharp speed increase exceeding 40 mph, with a peak acceleration of nearly 8 mph², followed by a drop to zero indicating the vehicle reached cruising speed. This pattern aligns with typical behavior after crossing an intersection or merging into faster traffic. In contrast, both the left-turning (Fig. 5 (b) ) and right-turning (Fig. 5 (c) ) vehicles reach lower peak speeds of ~ 23 mph and ~ 17 mph, respectively consistent with cautious turning behavior. Their acceleration profiles are more moderate (2.5–3.5 mph²) and exhibit smoother fluctuations across frames, reflecting gradual throttle adjustments during curved movements. Generalizability of Vision-Based Trajectory Estimation Across Diverse Intersections To evaluate the generalization capability of our proposed detection and tracking framework, we tested the YOLOv8-OBB-based model on a completely disjoint and previously unseen dataset, which contained no images from the original MCRUD dataset. This secondary dataset was captured at a different intersection located on Durango Drive in Nevada. This secondary site featured different scene composition, camera angle, and mounting height, resulting in noticeably different BEV distortions. Testing on this dataset allowed us to assess the framework’s adaptability to varied spatial configurations, a critical but often overlooked aspect in prior studies. The performance scores are lower compared to those obtained on the MCRUD dataset (which achieved a precision of 95%, recall of 97%, [email protected] of 99%, and [email protected] :0.95 of 92%), the model still performs reasonably well on this disjoint dataset, attaining a precision of 86%, recall of 82%, [email protected] of 78%, and [email protected] :0.95 of 71%. The drop in accuracy can be attributed to several factors such as vehicles in the Durango Drive footage appear significantly smaller due to the higher camera position, and both lighting variation and scene-specific distortion have affected detection accuracy. Despite these challenges, the model demonstrated strong generalization performance. It is generally accepted in object detection literature that an [email protected] above 70% is considered good, and an [email protected] :0.95 above 50% indicates robust performance across varying intersection-over-union thresholds ( 41 ). Therefore, the reported results underscore the model's reliability even in unseen and visually different environments. The figure illustrates detection outputs from the proposed YOLOv8-OBB model on several frames of the disjoint test dataset from the Durango Drive intersection. In Fig. 6 (a) through (d) , the model demonstrates high confidence in detecting vehicles such as cars and trucks, with confidence scores consistently exceeding 0.90 indicating robust performance even in a visually different environment. Notably, the detection of a pedestrian (visible in frames b and c) highlights the model’s dynamic adaptation. Initially, the pedestrian is either missed (Fig. 6 (a) ) or assigned a low confidence score of 0.30 (Fig. 6 (b) ), primarily due to the rarity of pedestrian instances in the training set and the tendency of the model to confuse small static objects like pedestrians with crosswalk markings. However, as the pedestrian continues to move across subsequent frames, the model correctly identifies the object and increases its confidence to 0.81 Fig. 6 (c) . This progression underscores the value of motion cues in distinguishing dynamic objects from static background elements. DISCUSSION AND CONCLUSION This study introduced a comprehensive vision-based framework for detecting, tracking, and estimating vehicle trajectories using BEV-transformed traffic camera footage. One of the core contributions lies in developing a YOLOv8-OBB-based detection model that operates directly on BEV images without requiring architectural modification. Unlike prior approaches that relied on complex 2D-to-3D conversions or perspective transformations tied to calibrated views, the proposed method streamlines BEV detection as a plug-and-play solution, compatible with standard object detection pipelines. This not only enhances scalability but also opens up BEV-based traffic analysis to a wider range of researchers and practitioners without the need for in-depth geometric modeling expertise. The novel space-oriented bounding box derived using interpretable vehicle features such as headlights, tires, and geometric distortions enabled the system to estimate the actual occupied space more precisely than conventional bounding boxes. This spatial accuracy is crucial for realistic trajectory estimation, particularly in BEV scenes where objects exhibit non-standard shapes and orientations due to distortion. Training and validation results confirmed the robustness of the model, which achieved high performance across all major detection metrics: 95% precision, 97% recall, 99% [email protected] , and 92% [email protected] :0.95 on the primary MCRUD dataset. The model showed excellent convergence behavior and strong generalization to validation data, supported by aligned loss curves and high classification accuracy across all classes. In real-world traffic environments, smooth and physically consistent trajectories are essential for applications such as speed estimation, conflict analysis, and risk-based traffic management. The proposed trajectory post-processing framework based on temporal sampling, moving average smoothing, and orientation correction successfully mitigated frame-to-frame jitter and instability during turning maneuvers. The resulting speed and acceleration profiles captured the distinct motion characteristics of straight-moving versus turning vehicles, with values aligning well with expected real-world driving behavior. These dynamics serve as both validation of the motion pipeline and a foundation for future integration into conflict detection algorithms. Perhaps most importantly, the framework was evaluated on a completely disjoint intersection dataset with different camera angles, heights, and visual context from the training data. Although the accuracy was slightly reduced on this dataset (precision: 86%, recall: 82%, [email protected] :78%, [email protected] :0.95: 71%), these values still meet widely accepted thresholds in object detection research. The performance drop was primarily due to smaller vehicle sizes, lighting variation, and higher camera elevation, which affected the fidelity of visual cues in BEV space. Notably, the model’s detection improved dynamically over time for rare and underrepresented classes such as pedestrians, showing increased confidence as motion cues became available. This highlights the model’s ability to adapt based on temporal context an important trait for real-time monitoring systems. The proposed framework demonstrates strong potential for real-world implementation, offering a calibration-free and scalable solution for vehicle detection and trajectory estimation using existing CCTV infrastructure. Its ability to generalize across varied camera setups makes it suitable for large-scale deployment by transportation agencies and ITS centers. By accurately capturing vehicle occupancy and motion dynamics in BEV space, the framework enables proactive safety analysis, near-miss detection, and data-driven intersection design. This study advances vision-based traffic analysis by bridging deep learning detection methods with practical transportation applications. While the model performed robustly, future work should focus on expanding the training dataset to include varied intersection geometries, camera perspectives, and lighting conditions. Enhancing the representation of pedestrians and cyclists, along with exploring domain adaptation techniques, will further improve model generalizability and support broader applicability in real-world contexts. Declarations FUNDING The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the University of North Dakota (UND) Early Career Award program (grant no. U31770-2710-1289). AUTHOR CONTRIBUTIONS The authors confirm contribution to the paper as follows: study conception and design: Gaweesh and Abdelhadi; data collection: Roy; analysis and interpretation of results: Roy and Gaweesh; draft manuscript preparation: Roy, Gaweesh, and Abdelhadi. All authors reviewed the results and approved the final version of the manuscript. ACKNOWLEDGMENT The authors gratefully acknowledge the University of North Dakota (UND) for supporting this research through the Early Career Faculty Award program, which made this work possible. References Al-Omari MMA, Abdel-Aty M (2023) Evaluation of Driving Behavior and Traffic Safety at a Shifting Movements Intersection. Transp Res Rec 2677(1):1228–1242. https://doi.org/10.1177/03611981221103865/ASSET/1231EBEC-7EB7-412C-9BED-81EA7DE2EF23/ASSETS/IMAGES/LARGE/10.1177_03611981221103865-FIG11.JPG Liu Y, Alsaleh R, Sayed T (2024) Modelling Motorized and Non-Motorized Vehicle Conflicts Using Multiagent Inverse Reinforcement Learning Approach. Taylor Francis 12(1). https://doi.org/10.1080/21680566.2024.2314762 Zhang Y, Liu L T. Z.-I. C. on Autonomous, and undefined 2022. Extracting Traffic Conflict at Urban Intersection Using Deep Learning Trajectory Detection. SpringerY Zhang, L Liu, T ZhuInternational Conference on Autonomous Unmanned Systems, 2022•Springer Reyad P, Sayed T, Essa M, Zheng L (2022) Real-Time Crash-Risk Optimization at Signalized Intersections. Transp Res Rec 2676(12):32–50. https://doi.org/10.1177/03611981211062891 Gaweesh SM, Ahmed I, Ahmed MM (2023) Analysis Framework to Assess Crash Severity for Large Trucks on Rural Interstate Roads Utilizing the Latent Class and Random Parameter Model. journals.sagepub.com , Vol. 2677, No. 9, pp. 130–150. https://doi.org/10.1177/03611981231158627 Gaweesh SM, Bakhshi AK, Ahmed MM (2021) Safety Performance Assessment of Connected Vehicles in Mitigating the Risk of Secondary Crashes: A Driving Simulator Study. journals.sagepub.com , Vol. 2675, No. 12, pp. 117–129. https://doi.org/10.1177/03611981211027881 Wu Y, Abdel-Aty M, Zheng O, Cai Q, Zhang S (2020) Automated Safety Diagnosis Based on Unmanned Aerial Vehicle Video and Deep Learning Algorithm. Transp Res Rec 2674(8):350–359. https://doi.org/10.1177/0361198120925808 Yao R, Zeng W, Chen Y, He Z (2021) A Deep Learning Framework for Modelling Left-Turning Vehicle Behaviour Considering Diagonal-Crossing Motorcycle Conflicts at Mixed-Flow Intersections. Transp Res Part C: Emerg Technol 132. https://doi.org/10.1016/J.TRC.2021.103415 Zheng L, Fan S, Ma S, Research HJ-T (2024) and undefined Multi-Type Traffic Conflict Identification at Signalized Intersections Based on LiDAR Point Cloud. journals.sagepub.comL Zheng, S Fan, S Ma, H JiaoTransportation Research Record, 2024•journals.sagepub.com , Vol. 2024, No. 10, 2024, pp. 916–925. https://doi.org/10.1177/03611981241235178 Oh J, Min J, Kim M, Cho H, Oh J, Kim M, Cho H (2009) Development of an Automatic Traffic Conflict Detection System Based on Image Tracking Technology. journals.sagepub.comJ Oh, J Min, M Kim, H ChoTransportation Research Record, 2009•journals.sagepub.com , Vol. 2129, No. 2129, pp. 45–54. https://doi.org/10.3141/2129-06 Sayed T, Ismail K, Zaki M, Autey J (2012) Feasibility of Computer Vision-Based Safety Evaluations. Transp Res Rec No. 2280:18–27. https://doi.org/10.3141/2280-03 Saunier N, Record TS-TR (2007) and undefined Automated Analysis of Road Safety with Video Data. journals.sagepub.comN Saunier, T SayedTransportation Research Record, 2007•journals.sagepub.com , No. 2019, 2007, pp. 57–64. https://doi.org/10.3141/2019-08 St-Aubin P, Saunier N, Miranda-Moreno L (2015) Transp Res Part C: Emerg Technol 58:363–379. https://doi.org/10.1016/j.trc.2015.04.007 . Large-Scale Automated Proactive Road Safety Analysis Using Video Data Mohamed A, Li L, Ahmed MM (2023) Automated Traffic Safety Assessment Tool Utilizing Monocular 3-D Convolutional Neural Network-Based Detection Algorithm at Signalized Intersections Kronprasert N, Sutheerakul C, Satiennam T, Luathep P (2021) Intersection Safety Assessment Using Video-Based Traffic Conflict Analysis: The Case Study of Thailand. Sustain (Switzerland) 13(22). https://doi.org/10.3390/su132212722 Mishra A, Chen K, Poddar S, Posadas E, Rangarajan A, Ranka S (2022) Using Video Analytics to Improve Traffic Intersection Safety and Performance. Vehicles , Vol. 4, No. 4, pp. 1288–1313. https://doi.org/10.3390/vehicles4040068 Xie E, 2⋆ ZY, Zhou D, Philion J, Anandkumar A, Fidler S, 1⋆ PL, Alvarez JM M 2 BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation Zhu M, Zhang S, Zhong Y, Lu P, Peng H (2021) and J. Lenneman. Monocular 3D Vehicle Detection Using Uncalibrated Traffic Cameras through Homography Rashed H, Essam M, Mohamed M, Sallab AE, Yogamani S (2021) BEV-MODNet: Monocular Camera Based Bird’s Eye View Moving Object Detection for Autonomous Driving He T S. S.-P. of the A. C. on Artificial, and undefined 2019. Mono3d++: Monocular 3d Vehicle Detection with Two-Scale 3d Hypotheses and Task Priors. ojs.aaai.orgT He, S SoattoProceedings of the AAAI Conference on Artificial Intelligence, 2019•ojs.aaai.org , p. 19 Tang X, Wang W, Song H, Zhao C (2023) CenterLoc3D: Monocular 3D Vehicle Localization Network for Roadside Surveillance Cameras. Complex and Intelligent Systems , Vol. 9, No. 4, pp. 4349–4368. https://doi.org/10.1007/s40747-022-00962-9 Liu W, Li Q, Yang W, Cai J, Yu Y, Ma Y, He S, Pan J (2022) Monocular BEV Perception of Road Scenes via Front-to-Top View Projection Li J, Chen S, Zhang F, Li E, Yang T, Lu Z (2019) An Adaptive Framework for Multi-Vehicle Ground Speed Estimation in Airborne Videos. Remote Sens 11(10). https://doi.org/10.3390/RS11101241 Linlin Zhang Student C, Yu X, Student P, Daud A A. Rashid Mussah Student, and A.-G. Associate Professor. Application of 2D Homography for High Resolution Traffic Data Collection Using CCTV Yang G, Ahmed M, Gaweesh S (2020) Adomah. Connected Vehicle Real-Time Traveler Information Messages for Freeway Speed Harmonization under Adverse Weather Conditions: Trajectory Level Analysis Using Driving Simulator. Accid Anal Prev 146. https://doi.org/10.1016/j.aap.2020.105707 Qin Z, Wang J Y. L.-P. of the A. conference on artificial, and undefined 2019. Monogrnet: A Geometric Reasoning Network for Monocular 3d Object Localization. aaai.orgZ Qin , J Wang, Y LuProceedings of the AAAI conference on artificial intelligence, 2019•aaai.org Jia J, Li Z, Shi Y MonoUNI: A Unified Vehicle and Infrastructure-Side Monocular 3D Object Detection Network with Sufficient Depth Clues Palffy A, Pool E, Baratam S, Kooij JFP, Gavrila DM Multi-Class Road User Detection with 3 + 1D Radar in the View-of-Delft Dataset Carta S, Castrillón-Santana M, Marras M, Mohamed S, Podda AS, Saia R, Sau M (2024) and W. Zimmer. RoadSense3D: A Framework for Roadside Monocular 3D Object Detection Ibrahim MR, TopView (2024) Vectorising Road Users in a Bird’s Eye View from Uncalibrated Street-Level Imagery with Deep Learning. https://doi.org/10.1007/s00521-025-11152-2 Rezaei M, Azarmi M (2023) Mir. 3D-Net: Monocular 3D Object Recognition for Traffic Monitoring. Expert Syst Appl 227. https://doi.org/10.1016/j.eswa.2023.120253 Krajewski R, Bock J, Kloeker L, Eckstein L (2018) The highD Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on German Highways for Validation of Highly Automated Driving Systems. IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC , Vol. 2018-November, pp. 2118–2125. https://doi.org/10.1109/ITSC.2018.8569552 Hu H-N, Cai Q-Z, Wang D, Lin J, Sun M, Krähenbühl P, Darrell T, Yu F Joint Monocular 3D Vehicle Detection and Tracking Barmpounakis E, Geroliminis N (2020) On the New Era of Urban Traffic Monitoring with Massive Drone Data: The pNEUMA Large-Scale Field Experiment. Transp Res Part C: Emerg Technol 111:50–71. https://doi.org/10.1016/J.TRC.2019.11.023 Khan, M., W. Ectors, … T. B.-T., and undefined 2017. Unmanned Aerial Vehicle–Based Traffic Analysis: Methodological Framework for Automated Multivehicle Trajectory Extraction. journals.sagepub.comMA Khan, W Ectors, T Bellemans, D Janssens, G WetsTransportation research record, 2017•journals.sagepub.com , Vol. 2626, 2017, pp. 25–33. https://doi.org/10.3141/2626-04 Yohannes E, Lin CY, Shih TK, Thaipisutikul T, Enkhbat A (2023) Utaminingrum. An Improved Speed Estimation Using Deep Homography Transformation Regression Network on Monocular Videos. IEEE Access 11:5955–5965. https://doi.org/10.1109/ACCESS.2023.3236512 Kim Y, Kum D (2019) Deep Learning Based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image Coifman B, Li L (2017) A Critical Evaluation of the Next Generation Simulation (NGSIM) Vehicle Trajectory Dataset. Transp Res Part B: Methodological 105:362–377. https://doi.org/10.1016/J.TRB.2017.09.018 Soma K, Shibu L, Meenakshi N A Real-Time Vehicle Detection and Speed Estimation Using YOLO V8. (2024) International Conference on Advances in Data Engineering and Intelligent Computing Systems, ADICS 2024 , 2024. https://doi.org/10.1109/ADICS58448.2024.10533551 Islam N, Ray SK, Hossain MA, Rashidul Hasan MAFM, Alamin and M. B. Al Zabir Shammo. Vehicle Classification and Detection Using YOLOv8: A Study on Highway Traffic Analysis. International Conference on Recent Progresses in Science, Engineering and Technology, ICRPSET (2024), 2024., 2024. https://doi.org/10.1109/ICRPSET64863.2024.10955913 Maity S, Chakraborty A, Singh PK, Sarkar R (2023) Performance Comparison of Various YOLO Models for Vehicle Detection: An Experimental Study. Lecture Notes in Networks and Systems , Vol. 787 LNNS, pp. 677–684. https://doi.org/10.1007/978-981-99-6550-2_50 Mohamed A, Ahmed M (2024) Towards Rapid Safety Assessment of Signalized Intersections: An in-Depth Comparison of Computer Vision Algorithms. Adv Transp Stud 4:101–116. No. Special issue https://doi.org/10.53136/97912218167168 Dwyer B, Nelson J, Hansen TROBOFLOW Roboflow (Version 1.0) . https://roboflow.com./ Levi K, Weiss Y (2004) Learning Object Detection from a Small Number of Examples: The Importance of Good Features. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , Vol. 2. https://doi.org/10.1109/CVPR.2004.1315144 OpenCV (2025) OpenCV Modules. https://docs.opencv.org/4.x/index.html . Accessed Apr. 22 V8.1 (2025) 0 Release - YOLOv8 Oriented Bounding Boxes (OBB) · Ultralytics · Discussion #7472 · GitHub. https://github.com/orgs/ultralytics/discussions/7472 . Accessed Apr. 22 Multi-Object Tracking with Ultralytics YOLO - Ultralytics YOLO Docs (2025) https://docs.ultralytics.com/modes/track/ . Accessed Apr. 22 Ye Z, Zhang H, Gu J, Li X (2023) YOLOv7-3D: A Monocular 3D Traffic Object Detection Method from a Roadside Perspective. Appl Sci (Switzerland) 13(20). https://doi.org/10.3390/app132011402 Farhat W, Ben Rhaiem O, Faiedh H, Souani. C (2025) YOLO-TSR: A Novel YOLOv8-Based Network for Robust Traffic Sign Recognition. Transp Res Rec. https://doi.org/10.1177/03611981251327213/ASSET/3FA720C8-6B9D-4791-8478-5B573347E6C0/ASSETS/IMAGES/LARGE/10.1177_03611981251327213-FIG16.JPG Jiao J, Wang H (2022) Traffic Behavior Recognition from Traffic Videos under Occlusion Condition: A Kalman Filter Approach. Transp Res Rec 2676(7):55–65. https://doi.org/10.1177/03611981221076426;WGROUP:STRING:PUBLICATION An Introduction to BYTETrack (2025) Multi-Object Tracking by Associating Every Detection Box | Datature Blog. https://datature.io/blog/introduction-to-bytetrack-multi-object-tracking-by-associating-every-detection-box . Accessed July 14 Kamil DA, Wahyono A, Harjoko (2024) Jo. Vehicle Speed Estimation Using Consecutive Frame Approaches and Deep Image Homography for Image Rectification on Monocular Videos. IEEE Access. https://doi.org/10.1109/ACCESS.2024.3508135 Yin F, Makris D, Velastin SA, Ellis T (2015) Calibration and Object Correspondence in Camera Networks with Widely Separated Overlapping Views. IET Comput Vision 9(3):354–367. https://doi.org/10.1049/IET-CVI.2013.0301 Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7916322","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":533126866,"identity":"e503a4d9-2f8f-468d-9fc1-98b3707047a9","order_by":0,"name":"Swaranjit Roy","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5ElEQVRIiWNgGAWjYPACNh4IXQHEBxgYJIjSAtFzhngtDAxgLYxtRGiRn5H87MPPHXwy9gy8Dx/zzjucx3eA+eBtHjxaDG6kGc/sPQNyGLuxMe+2w8WSB9iSrfFqkU4wZuBtA2lhY5MGaknccIDHTBqfFvnZ6Z8Z/8K1zAFp4f+GVwvD7RxjZoQtDWBb2PBqMbj/pphZFqTlMBuz4Zxj6YkzD7MZW87B57Ce45sZ37Yds2dvb2N88KbGOrHvePPDG2/wOQwCjjEwMDMwMIHdw0xYOQjUgEnGH8SpHgWjYBSMghEGAJ/7QvGEx9UnAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0009-0003-4003-5451","institution":"University of North Dakota","correspondingAuthor":true,"prefix":"","firstName":"Swaranjit","middleName":"","lastName":"Roy","suffix":""},{"id":533126996,"identity":"6394197f-c0e4-45ab-9511-31e732fdf08e","order_by":1,"name":"Ahmed Abdelhadi, Ph.D.","email":"","orcid":"https://orcid.org/0000-0001-6774-2568","institution":"University of North Dakota","correspondingAuthor":false,"prefix":"","firstName":"Ahmed","middleName":"","lastName":"Abdelhadi","suffix":"Ph.D."},{"id":533126939,"identity":"0f3e41df-2796-4f54-a0c1-9b9bdc801588","order_by":2,"name":"Sherif M. Gaweesh, Ph.D., P.E., RSP","email":"","orcid":"https://orcid.org/0000-0001-7977-6378","institution":"University of North Dakota","correspondingAuthor":false,"prefix":"","firstName":"P.","middleName":"Ph.D. Sherif M.","lastName":"Gaweesh","suffix":"Ph.D."}],"badges":[],"createdAt":"2025-10-22 04:29:43","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-7916322/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7916322/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":94162681,"identity":"7353749a-56e8-4d36-bceb-41a3fbfdfa31","added_by":"auto","created_at":"2025-10-23 05:18:39","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":3120790,"visible":true,"origin":"","legend":"","description":"","filename":"TrafficBEVPaperFinalVersion.docx","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/520fc45238965bf8e42f796b.docx"},{"id":94162498,"identity":"eb4d8447-2f2b-4e59-b172-96b2dc498768","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs7916322.json","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/b72b3dd806420ed371efd347.json"},{"id":94162506,"identity":"5b2f6f1c-c502-459d-b999-989fcf153b15","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":125383,"visible":true,"origin":"","legend":"","description":"","filename":"rs79163220enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/bb70933e8f51f2b7544daa45.xml"},{"id":94163220,"identity":"00a184a7-06ad-4fc9-9d5c-d755ed68bd98","added_by":"auto","created_at":"2025-10-23 05:26:39","extension":"jpeg","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":519190,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/02ef2774b94969956d2b6fbb.jpeg"},{"id":94162504,"identity":"498d3b11-98e4-484b-87ac-7484d5207dc8","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"jpeg","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1193689,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/fc4f5104a1b2281ccc78fb8f.jpeg"},{"id":94162680,"identity":"1f5667e1-dc8b-4229-9024-74a690fcc394","added_by":"auto","created_at":"2025-10-23 05:18:39","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":133380,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/7cead8b23101ea29a7ba9fbc.png"},{"id":94162682,"identity":"7f682e0f-59d1-4543-b2c0-e7cdf9c54965","added_by":"auto","created_at":"2025-10-23 05:18:39","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":576088,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/3385196284bb11a4ff732c65.png"},{"id":94162511,"identity":"9eb25245-2d94-4490-900f-bc05ead00590","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":90742,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/f670973ad27ca321ceb088cf.png"},{"id":94162512,"identity":"c417bbc8-69b0-4514-83b0-7263c201fcb2","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":307981,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/04a372c1e3382e003572a1c3.png"},{"id":94162519,"identity":"28381909-f343-49ea-bce8-3a3e04734372","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":200611,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/0779f590500070a44664b272.png"},{"id":94162683,"identity":"07441ca0-6972-40a7-a27c-ecea68c95303","added_by":"auto","created_at":"2025-10-23 05:18:39","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":339007,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/5d4c3f7c51b032b4d63caac6.png"},{"id":94162684,"identity":"9decc1d5-14fc-4deb-b9fe-825161ca7e90","added_by":"auto","created_at":"2025-10-23 05:18:39","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":24333,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/85237069389ac54ab4af1b9a.png"},{"id":94162509,"identity":"eb737c22-442f-4326-bfe6-ede1f503245a","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":97279,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/b024935af84671e165fa20f2.png"},{"id":94162508,"identity":"28916a23-3aae-45cd-98a2-56bec41ee980","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":23982,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/b24c145a7fd90598aee9a6af.png"},{"id":94162518,"identity":"793d7d8e-0c95-4fe7-83cf-a19a2273e64d","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":47167,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/d668a50f0fdac3d03adef3a8.png"},{"id":94162515,"identity":"82c7ccbe-7857-4db7-8655-a74a6bbb7eb8","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"xml","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":124307,"visible":true,"origin":"","legend":"","description":"","filename":"rs79163220structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/4969b8764a2b4812d6cf0cbf.xml"},{"id":94162513,"identity":"f07167f5-b17d-4026-bd86-2594d0c77a55","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"html","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":136030,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/2edef830c70904899e130a2e.html"},{"id":94162677,"identity":"bc136b37-f6ea-4066-b24b-79963ac32d25","added_by":"auto","created_at":"2025-10-23 05:18:39","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":565613,"visible":true,"origin":"","legend":"\u003cp\u003eHomography Transformation Workflow to Generate a Calibrated Intersection BEV\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/3a25b73c0c37d325042c4ee3.png"},{"id":94162499,"identity":"66c0c1d0-c868-4796-ba69-a75bcd7d02ef","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":617880,"visible":true,"origin":"","legend":"\u003cp\u003eSteps Followed to Define Space of Detected Vehicles for the Multiple Movements.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/d00e05795d83d96141ae1e07.png"},{"id":94162497,"identity":"ea0d5641-0aba-49f3-a0dd-1366d30eef8f","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":133380,"visible":true,"origin":"","legend":"\u003cp\u003ePerformance of the YOLOv8-OBB Model on the BEV Road User Detection Dataset. (a) Convergence on Training and Validation Set, (b) Confusion Matrix Raw, (c) Confusion Matrix Normalized\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/3ca815400461f015a4c64717.png"},{"id":94162502,"identity":"81b284f4-90c8-422c-89df-15fade9f6d98","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":576088,"visible":true,"origin":"","legend":"\u003cp\u003eVisualization of the Detection Performance in Intersection Environment\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/3c4c60e0613150b06e8dbe34.png"},{"id":94162679,"identity":"1ac30b49-10ba-4277-b788-104cc8230a9e","added_by":"auto","created_at":"2025-10-23 05:18:39","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":90742,"visible":true,"origin":"","legend":"\u003cp\u003eSpeed and acceleration profiles of vehicles performing different maneuvers: (a) straight-moving vehicle, (b) left-turning vehicle, and (c) right-turning vehicle.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/7fa2a987606e80aef9ed5852.png"},{"id":94162507,"identity":"9c586c87-613f-4d0e-b60d-394b09d63853","added_by":"auto","created_at":"2025-10-23 05:10:39","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":307981,"visible":true,"origin":"","legend":"\u003cp\u003eVisualization of Detection on Disjoint Test Dataset to Assess Generalization Capability\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/65007bb52260970967b7b6c5.png"},{"id":94163328,"identity":"1234bc40-3b19-49aa-99df-8bd36c7aeecb","added_by":"auto","created_at":"2025-10-23 05:34:42","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3434507,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7916322/v1/e8d8bae4-7773-4d25-9279-770ef795440e.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eA Generalizable Vision-Based Framework for Vehicle Trajectory Estimation and Conflict Analysis at Intersections\u003c/p\u003e","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eIntersections are among the most crash-prone areas due to the complexity of maneuvers like turning, merging, and crossing, combined with time-sensitive decisions (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e). In the U.S., intersections account for nearly 40% of all crashes, 50% of serious injuries, and around 20% of fatalities (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e). Given their high risk, intersection safety has traditionally been evaluated using historical crash data. However, this reactive approach is limited by underreporting, lack of near-miss data, small sample sizes, and delayed feedback (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). To overcome these issues, recent research emphasizes proactive conflict analysis over crash-based evaluation (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e).\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eConflict\u003c/strong\u003e\u003cp\u003eanalysis focuses on near-miss events incidents that could have resulted in crashes had evasive actions not been taken (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e). These events occur more frequently than actual collisions and offer critical insight into unsafe interactions. Accurate detection of such events requires precise, continuous trajectory data (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). Common sources for trajectory estimation include UAV footage, remote sensing imagery, GPS, LiDAR point clouds, onboard vehicle sensors, and CCTV video (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan additionalcitationids=\"CR10\" citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e).\u003c/p\u003e\u003c/p\u003e\u003cp\u003eTraffic surveillance cameras are now widely deployed at intersections to monitor traffic, and advances in computer vision have made CCTV footage a scalable resource for safety management (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan additionalcitationids=\"CR12\" citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e). Compared to other data sources, CCTV offers practical benefits continuous operation, low cost, widespread coverage, and no need for in-vehicle instrumentation. As a result, recent research increasingly leverages CCTV streams for proactive safety assessments (\u003cspan additionalcitationids=\"CR15\" citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eVarious computer vision models have been applied to detect and track road users in CCTV footage (\u003cspan additionalcitationids=\"CR18 CR19 CR20 CR21\" citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e), with monocular 3D-based methods gaining popularity(\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e). However, Monocular 3D object detection from CCTV footage is challenging due to the absence of depth information (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e). Most approaches depend on prior knowledge of object size and camera calibration to map 2D images to 3D space using annotated datasets (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e), however such requirements are often unmet in traffic footage, where calibration data and 3D annotations are typically unavailable (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eTo address the limitations of monocular CCTV footage, Homography Transformation (HT) is often used to generate 2D Bird\u0026rsquo;s-Eye View (BEV) projections, enabling more reliable spatial analysis (\u003cspan additionalcitationids=\"CR18\" citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e). BEV mimics human spatial reasoning by emphasizing relative positions and movement patterns over exact depth, offering a map-like perspective that simplifies interpretation. It also reduces occlusion, prevents bounding box overlap, and allows accurate spatial measurement beneficial for trajectory analysis and conflict detection (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e). However, BEV transformations from traffic CCTV footage face challenges. HT-generated BEV images often contain distortions and occlusions, unlike true top-down views from aerial imager (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e). Inverse Perspective Mapping (IPM), a common HT method, stretches distant objects, distorting vehicle shapes and complicating detection (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e). While some studies have proposed enhancements for object detection in warped BEV images (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e), existing methods lack generalizability and are not easily transferable to other intersections, limiting their scalability.\u003c/p\u003e\u003cp\u003eThis study presents a novel and generalizable framework for BEV-based trajectory estimation from traffic CCTV footage, addressing the limitations of existing calibration-dependent and intersection-specific methods. The proposed approach introduces a custom-defined \"space\" estimation box that infers spatial relationships using visual cues such as vehicle headlights, tire positions, and shadows, thereby eliminating the need for intrinsic or extrinsic camera calibration. It integrates a customized YOLOv8-OBB detection model that operates directly on BEV-transformed images and maintains robust performance across varied intersection geometries. By linking detections over time, the framework reconstructs accurate vehicle trajectories and enables precise estimation of speed and acceleration in BEV space. This approach enhances scalability, reduces deployment complexity, and supports large-scale, proactive traffic safety analysis.\u003c/p\u003e"},{"header":"LITERATURE REVIEW","content":"\u003cp\u003eTraffic video footage has been widely explored for conflict analysis. One study used a clustering-based approach to identify and adapt conflicting trajectory pairs (\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e). Another developed an automated computer vision tool that significantly reduced rear-end, merging, and total conflicts (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e). A separate study combined UAV data with a convolutional neural network to improve road user detection, reporting substantial gains in accuracy (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e). These efforts have advanced video analytics in traffic conflict analysis and continue to guide research in this area. However, accurate trajectory estimation from standard CCTV footage remains an underexplored challenge, and addressing it could improve the reliability and scalability of conflict detection methods (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eMonocular 3D detection is widely studied for CCTV footage, but the absence of depth-sensing hardware or stereo vision remains a key limitation (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e). Depth must be inferred indirectly using visual cues such as object scale, perspective distortion, motion patterns, and scene geometry (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e). Some methods enforce geometric consistency between 2D and 3D by aligning box dimensions and orientations (\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e), while others use monocular depth estimation networks to predict pixel-wise depth maps from single images (\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e). These approaches, however, require 3D ground truth for supervised learning. Without it, models struggle to generalize, especially in complex urban scenes, and are further hindered by occlusion, lighting variation, and scene complexity (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e). Additionally, they struggle with occlusions, lighting variation, and urban complexity. CenterLoc3D is a 3D object localization framework that bypasses traditional 2D detection by directly predicting vehicle centroids, bounding box vertices, and 3D dimensions using a multi-scale fusion module and constrained loss function (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e). While accurate in real time, it depends on camera calibration, limiting use in uncalibrated traffic scenes. Another approach combines 3D box depth-ordering with an LSTM motion model for robust tracking but also requires stereo or calibrated sensors (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e). A separate study introduced a large-scale dataset to train Cube using R-CNN, though it faced challenges with detection drift and orientation errors (\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eUAV-based data collection offers a promising alternative to monocular 3D tracking by providing top-down views that eliminate perspective distortion and ensure consistent scale (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e, \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e). However, UAVs face limitations such as short battery life and FAA restrictions on flights over populated areas, making them impractical for continuous urban monitoring (\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e). These methods also rely on stable, high-altitude views that differ significantly from ground-mounted CCTV footage, which typically captures oblique angles. This geometric mismatch limits the direct applicability of UAV-trained models to CCTV data, which is more affected by occlusion, scale variation, and distortion (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eSeveral studies have attempted to convert CCTV footage into UAV-like top-down BEV using homography transformation, typically via vanishing point estimation or landmark-based methods. Vanishing points assume straight, unobstructed roads and vehicle-road alignment, making them unreliable in intersections or curved scenes (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e). Landmark-based methods align image features (e.g., crosswalks, stop lines) with reference imagery, offering better robustness in complex settings (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e). One study proposed tailed rotated box regression, linking vehicle base centers to extended BEV shapes using a \"tail\" vector, but it requires accurate 3D boxes, class-specific box dimensions, and customized YOLO outputs (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e). Other studies combined vanishing points with homography, but struggled in complex traffic environments (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e, \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e). Another applied IMU-based motion correction with homography for front-view BEV projection, though it depends on sensor data and is limited to highway-like scenarios (\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eA study proposed 3D-Net for efficient 3D object recognition, Multi-Class Object Tracking (MCOT) under occlusion, and Sliding Grid Inverse Perspective Mapping for automatic calibration to reduce distortion (\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e). However, its tracker depends on class-based identity assignment, limiting generalizability. YOLO and CenterTrack were evaluated for road user detection and trajectory estimation from CCTV footage (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). CenterTrack, effective for low-elevation cameras (5–15 ft), jointly detects and tracks via inter-frame displacement prediction. YOLOv7 was preferred for higher viewpoints due to stronger detection. Speed and orientation were estimated using methods from (\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e). However, CenterTrack’s performance degrades at elevated views, and YOLOv7 lacks built-in 3D detection or transformation, leading to inaccurate spatial estimates. BEV box orientations often misalign with vehicle paths, reducing reliability at intersections. Moreover, most existing studies rely on base YOLO variants, which do not support oriented bounding boxes an important limitation for capturing complex vehicle maneuvers in intersection scenarios (\u003cspan additionalcitationids=\"CR40\" citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e–\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eDespite numerous studies on BEV transformation and trajectory estimation from CCTV footage, existing work often falls short when it comes to direct integration with widely used open-source detection models like YOLO. This lack of compatibility limits practical applicability and reduces scalability, making it challenging for broader adoption in real-world traffic analysis tasks.\u003c/p\u003e\n\n"},{"header":"DATA COLLECTION AND STUDY SITE","content":"\u003cp\u003eTraffic CCTV footage was collected from a public livestream of the Four Corners intersection in downtown Coldwater, Michigan a signalized, high-volume site frequently used by large trucks and conveyors, offering diverse vehicle types for analysis (\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e). Its public availability supports transparency and reproducibility. Recordings were captured using OBS Studio from 6:00 AM to 6:00 PM on alternating days over one week at 30 fps, yielding 36 hours of footage. Selected frames were converted to BEV using Homography Transformation (HT) and balanced across vehicle classes to ensure robust model training. Annotation was performed using Roboflow for its intuitive interface, YOLO format support, and built-in tools for augmentation, versioning, and efficient dataset export (\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e).\u003c/p\u003e\u003ch3\u003eDataset for Multi-Class Road User Detection (MCRUD) from BEV Transformed Imagery\u003c/h3\u003e\u003cp\u003eTo address the lack of spatially grounded annotations in existing datasets, a custom BEV dataset was developed using CCTV video frames. Most public datasets are tailored to natural perspective views and do not capture the physical footprint of road users in top-down imagery, limiting their applicability for BEV-based detection and trajectory analysis. Additionally, prior studies often rely solely on class-based annotations (e.g., cars, pedestrians), without representing the actual space occupied, critical for trajectory reconstruction and interaction assessment. The developed dataset includes six classes: cars, trucks, bicycles, pedestrians, objects, and space. For each dynamic road user (car, truck, bicycle, pedestrian), two bounding boxes were annotated: (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) the object class and (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) the physical space it occupies. Static infrastructure elements (e.g., utility poles) were labeled as \"objects\" and paired with space boxes to ensure consistency in spatial representation. These \"space\" annotations are essential for accurate speed estimation, conflict analysis, and surrogate safety metrics. The final dataset contains 564 annotated images within the recommended 500–1000 image range for object detection (\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e). It supports BEV-specific model development and enhances generalizability by reflecting a transferable spatial structure suitable for varied intersections and camera angles.\u003c/p\u003e"},{"header":"METHODOLOGY","content":"\u003cp\u003eThe following subsections outline key components of the proposed methodology. HT and Camera Calibration describes how raw CCTV footage is transformed into a top-down BEV view using point-based mapping with aerial imagery. It also details a manual process for estimating pixel-to-meter ratios, enabling real-world measurements without intrinsic or extrinsic camera parameters. Bounding Boxes for Defining Vehicle Space introduces a novel feature-guided method that estimates vehicle footprints in BEV images using cues such as tire positions, headlights, distortion slopes, and shadows allowing adaptation to different vehicle orientations and movements beyond fixed-dimension bounding boxes. Object Detection details a custom-trained YOLOv8-OBB model designed to detect oriented vehicles directly in BEV space. Dynamic Object Tracking explains the use of ByteTrack to associate detections across frames and maintain consistent vehicle identities. Lastly, Trajectory Estimation and Refinement covers the extraction, smoothing, and refinement of vehicle paths to calculate motion parameters crucial for conflict and safety analysis.\u003c/p\u003e\n\u003ch3\u003eHomography Transformation and Camera Calibration\u003c/h3\u003e\n\u003cp\u003eTo generate the BEV transformation, a homography matrix was computed by manually selecting matching points between the original camera view and a reference top-down image Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u003cb\u003e(a)\u003c/b\u003e and \u003cb\u003e(b)\u003c/b\u003e were selected. Using these point pairs, OpenCV\u0026rsquo;s cv2.findHomography() function estimated the projective transformation matrix H which maps coordinates from the camera view to BEV space via x\u0026prime;=Hx, where x and x\u0026prime; are corresponding points in the source and destination images, respectively (\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e). This matrix was applied to each video frame using cv2.warpPerspective() for consistent projection into BEV (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e(d)), with the same transformation applied across the sequence using a saved .npy file. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e(c) shows the aligned overlay of the CCTV and satellite images, confirming calibration accuracy.\u003c/p\u003e\u003cp\u003eA manual point-based calibration method was developed to estimate the pixel-to-meter ratio for real-world spatial analysis. Using a static CCTV frame, known-distance point pairs were annotated, and their pixel distances computed. A grid search (starting at 1.0, incremented by 0.1) identified the ratio minimizing Mean Absolute Error (MAE) between pixel and real-world distances. This approach enables accurate calibration without requiring intrinsic or extrinsic camera parameters, making it ideal for uncalibrated traffic surveillance footage.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\n\u003ch3\u003eBounding Boxes to Define Vehicle Space\u003c/h3\u003e\n\u003cp\u003eTo enhance spatial estimation accuracy, vehicle movements were categorized by directional orientation specifically north\u0026ndash;south and east\u0026ndash;west due to variation in BEV appearance caused by motion direction and camera angle. This direction-specific approach is especially useful at intersections, where maneuvers and perspective distortions vary. Tailoring estimation to each direction ensures that bounding boxes more accurately represent the space occupied by vehicles. Trajectories were then estimated by tracking the center point of this space box across frames, offering a more accurate reflection of actual vehicle presence than visible footprint alone.\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e illustrates the step-by-step process for defining the spatial bounding box. The yellow line marks the reference edge, green lines guide distortion alignment, and blue lines indicate estimated vehicle width.\u003c/p\u003e\u003cp\u003eFor vehicles moving north or south in a straight path, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003cb\u003e(a)\u003c/b\u003e and \u003cb\u003e(b)\u003c/b\u003e, the process begins by drawing a horizontal line along the shadow edge and a vertical line between the two visible tires. Their intersection forms Point 1, the starting corner. Sloped lines are then drawn through each headlight, aligned with the vehicle\u0026rsquo;s distortion, forming Points 2 and 3 where they intersect the initial lines. A vertical line from Point 2 and a horizontal from Point 3 intersect to form Point 4. Connecting Points 1, 2, 4, and 3 yields the final spatial bounding box, accurately capturing the vehicle\u0026rsquo;s distorted footprint in the BEV view.\u003c/p\u003e\u003cp\u003eFor vehicles moving east or west in straight path, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003cb\u003e(c)\u003c/b\u003e and \u003cb\u003e(d)\u003c/b\u003e, the spatial bounding box was constructed using reference lines and average vehicle dimensions. A horizontal line was drawn along the tire alignment and a vertical line along the shadow edge, intersecting at Point 1, the starting corner. An angled line following the vehicle\u0026rsquo;s distortion and passing through the visible headlight intersected the horizontal line to form Point 2. Using known or estimated vehicle width, perpendicular lines were extended from the axis defined by Points 1 and 2 to locate Points 3 and 4. Together, these four points defined the final spatial bounding box.\u003c/p\u003e\u003cp\u003eFor turning movements of vehicles going north or south (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003cb\u003e(e)\u003c/b\u003e and \u003cb\u003e(f)\u003c/b\u003e), the spatial bounding box was constructed using tire and shadow alignment. A line was drawn between the visible tires, and a perpendicular line along the shadow edge. Their intersection marked Point 1. Two diagonal lines were then traced along the outer edges of the vehicle\u0026rsquo;s distortion, intersecting the initial lines to form Points 2 and 3. Finally, a line parallel to the tire line through Point 2 and a line parallel to the shadow line through Point 3 intersected at Point 4. Connecting all four points completed the bounding box.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eFor turning movements of vehicle going east or west (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003cb\u003e(g)\u003c/b\u003e and \u003cb\u003e(h)\u003c/b\u003e), the spatial bounding box was constructed using two perpendicular reference lines: one along the tire alignment and the other along the shadow edge. Their intersection marked Point 1. A diagonal line following the vehicle\u0026rsquo;s distortion slope and passing through the headlight intersected a reference line to form Point 2. The vehicle\u0026rsquo;s width was then estimated using top-down reference imagery and used to project Points 3 and 4 perpendicular to the line between Points 1 and 2, completing the bounding box.\u003c/p\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003eDevelopment and Optimization of the Custom Object Detection Algorithm\u003c/h2\u003e\u003cp\u003e Accurate detection in BEV-transformed frames requires custom training, as most pre-trained models are optimized for natural perspective views. This study trained a detection model using feature-guided spatial bounding boxes tailored to BEV geometry, enabling more precise representation of vehicle footprints an advancement overlooked in prior studies.\u003c/p\u003e\u003cp\u003eThe YOLOv8 architecture, released in January 2024, was selected for its native support for oriented bounding boxes (OBB), essential for capturing object orientation during turning maneuvers (\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e). YOLOv8-OBB was preferred over YOLOv11-OBB due to its proven stability, thorough benchmarking, strong documentation, and wide adoption at the time (\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e). In 2023 alone, over 19\u0026nbsp;million YOLOv8 models were trained, with 15\u0026nbsp;billion object detections recorded and 4\u0026nbsp;million downloads in December Click or tap here to enter text.(\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e). The use of this YOLOv8-OBB model specifically designed to handle BEV and aerial imagery presents a key contribution of this paper, distinguishing it from existing works that did not leverage such advanced architecture.\u003c/p\u003e\u003cp\u003eIn this study, the YOLOv8-OBB model was custom-trained on the BEV Road User Detection Dataset using Google Colab\u0026rsquo;s T4 GPU. Training ran for up to 500 epochs with early stopping applied after 50 epochs of no improvement, following best practices in recent YOLO research (\u003cspan additionalcitationids=\"CR49\" citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e). This patience setting helps prevent overfitting while maintaining computational efficiency (\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e).\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eDynamic Object Tracking and Trajectory Estimation\u003c/h3\u003e\n\u003cp\u003eThis study uses ByteTrack, a state-of-the-art multi-object tracking algorithm known for high accuracy and efficiency (\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e). By associating both high- and low-confidence detections, ByteTrack improves tracking continuity under occlusions and missed frames. Unlike SORT, which discards low-confidence boxes and suffers from identity switches, ByteTrack offers more stable tracking (\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e). Compared to DeepSORT, it avoids computationally expensive feature extraction while maintaining competitive accuracy. Its simplicity, compatibility with YOLO detectors, and low overhead make it suitable for real-time applications.\u003c/p\u003e\u003cp\u003eTo enhance reliability, trajectory points were sampled every 5th frame and smoothed using a moving average filter (window size\u0026thinsp;=\u0026thinsp;3). This configuration, chosen after testing intervals of 2\u0026ndash;10 frames and windows of 2\u0026ndash;6, balanced jitter reduction with preservation of motion detail crucial for accurate speed and acceleration estimation.\u003c/p\u003e\n\u003ch3\u003eCamera Calibration and Validation\u003c/h3\u003e\n\u003cp\u003eThe calibration technique in this study estimates the optimal pixel-to-meter (px/m) ratio by manually selecting known-distance reference points in a BEV-transformed image. Users define multiple reference lines by clicking two points and inputting their real-world distance (in meters). Pixel distances are then computed, and a range of px/m ratios is evaluated by calculating the Mean Absolute Error (MAE) between estimated and actual distances.\u003c/p\u003e\u003cp\u003eThe optimal ratio is identified as the value that minimizes overall MAE. In this case, an iterative search (starting at 1.0 px/m with 0.1 increments) yielded an optimal value of 19.2 px/m. Reference points, such as lane edges, were selected for their visibility and consistency across satellite and BEV images, minimizing alignment errors. Calibration accuracy was further assessed by computing percentage errors, as reported in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\u003cp\u003eTo validate the accuracy of the calibrated BEV-transformed images, a quantitative analysis was performed using ground-truth distances from a satellite image. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e with their real-world lengths and corresponding measurements in the BEV image. The percentage error for each line was calculated using the formula:\u003c/p\u003e\u003cp\u003e% Error = \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{\\text{D}\\text{i}\\text{m}\\text{e}\\text{n}\\text{s}\\text{i}\\text{o}\\text{n}\\:in\\:BEV\\:Perspective\\:-\\:\\text{D}\\text{i}\\text{m}\\text{e}\\text{n}\\text{s}\\text{i}\\text{o}\\text{n}\\:in\\:Sateline\\:Image}{\\text{D}\\text{i}\\text{m}\\text{e}\\text{n}\\text{s}\\text{i}\\text{o}\\text{n}\\:in\\:Sateline\\:Image}\\)\u003c/span\u003e\u003c/span\u003e x 100\u003c/p\u003e\u003cp\u003eThe results show an average percentage error of 0.84%, with a maximum deviation of only 1.67% well within the commonly accepted 2% threshold for homography-based spatial calibration (\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e). This confirms that the BEV transformation maintains geometric fidelity, supporting reliable estimation of speed, acceleration, and vehicle spacing.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePercentage Error Between Satellite and BEV Image Measurements\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eLines\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e\u003cp\u003eDimension\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e% Error\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSatellite (True)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eBEV (Observed)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAB\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e38.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e38.76\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.68\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCD\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e38.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e38.91\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.54\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEF\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e12.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e12.24\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.33\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGH\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e10.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e10.37\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1.67\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGI\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e33.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e34.13\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.98\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eTo further validate the transformation, vehicle dimensions estimated from BEV frames were compared to real-world specifications. Typical SUVs measure 4.8\u0026thinsp;\u0026plusmn;\u0026thinsp;0.15 m (length) and 1.9\u0026thinsp;\u0026plusmn;\u0026thinsp;0.1 m (width), while sedans are around 4.4\u0026thinsp;\u0026plusmn;\u0026thinsp;0.2 m by 1.75\u0026thinsp;\u0026plusmn;\u0026thinsp;0.08 m. A two-sample t-test showed no significant difference between BEV-estimated and actual dimensions (t(\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e)\u0026thinsp;=\u0026thinsp;0.00, p\u0026thinsp;\u0026lt;\u0026thinsp;0.01), confirming that the spatial bounding boxes reliably capture real-world vehicle sizes.\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003eModel Training, Validation and Detection Performance\u003c/h2\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e\u003cb\u003e(a)\u003c/b\u003e presents training and validation plots for the YOLOv8-Oriented Bounding Box (OBB) model over ~ 230 epochs. The top row shows training metrics, while the bottom row displays validation performance, offering a complete view of model learning in BEV-transformed images. The model showed consistent convergence across all key loss components: box loss (bounding box accuracy), classification loss (label prediction), and Distribution Focal Loss (DFL), which improves boundary precision through continuous label distributions. Box loss decreased from 2.5 to \u0026lt; 0.5, classification loss from \u0026gt; 3.0 to 0.3, and DFL from 2.9 to 1.5, then stabilized indicating healthy learning progression. Validation losses closely followed training trends, confirming strong generalization and no signs of overfitting.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eModel effectiveness was further validated using standard evaluation metrics. Precision improved from 50% to over 95%, and recall from 25% to above 97%, indicating reduced false positives and significantly improved detection coverage. The model achieved near-perfect detection accuracy, with mAP@50 reaching 99% and mAP@50–95 at 92% demonstrating strong performance under both lenient and strict IoU thresholds.\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e \u003cb\u003e(b)\u003c/b\u003e and \u003cb\u003e(c)\u003c/b\u003e show the confusion matrices for multi-class classification on the BEV-transformed dataset. The raw matrix reflects absolute counts, while the normalized matrix shows per-class accuracy. The model correctly classified 218 cars, 239 space instances, and 67 objects. Misclassifications were minimal, such as 11 space instances mislabeled as background and 13 as trucks. A few rare errors occurred between pedestrians and trucks. The normalized matrix shows near-perfect accuracy (1.00) for most classes, including bicycle, car, object, and pedestrian. The space class, critical for trajectory estimation, was classified correctly 98% of the time. Remaining confusion such as trucks misidentified as space (48%) and 2% of space as background likely stems from overlapping footprints in BEV views.\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e illustrates temporal consistency and variability in detection confidence across frames in a multi-class traffic scene. The model accurately detects all annotated classes cars, trucks, bicycles, pedestrians, static objects, and spatial extents of moving vehicles. One clear trend observed in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e\u003cb\u003e(b)\u003c/b\u003e confidence scores remain low when vehicles are partially visible (e.g., 0.43 and 0.47) and rise above 0.85 once the vehicle fully enters the frame. This suggests the model relies on complete visual cues like headlights, tires, and contours for accurate space detection. This dependency highlights the importance of full vehicle visibility for reliable trajectory initialization, conflict analysis, and real-time monitoring.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e\u003cb\u003e(b)\u003c/b\u003e and \u003cb\u003e(d)\u003c/b\u003e, reveal that pedestrian and bicycle classes consistently receive lower confidence scores than larger vehicles. This is due to their smaller size in BEV frames resulting in fewer pixels and their underrepresentation in the training dataset. These findings highlight the model’s strength in detecting fully visible, larger road users, while also identifying the need for improved early detection and better class balance during training. Though occasional misclassifications occur between visually similar classes, they are rare and do not significantly impact tasks like trajectory estimation or safety analysis.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003eTrajectory Analysis for Speed and Acceleration Estimation\u003c/h2\u003e\u003cp\u003eTrajectory estimation is essential for deriving motion characteristics like speed and acceleration, which underpin conflict analysis and behavioral assessment in traffic scenes. However, raw trajectories often exhibit jitter, noise, and orientation errors especially during turning due to occlusions, oblique views, and inconsistent bounding box centers.\u003c/p\u003e\u003cp\u003eTo address these issues, a three-step post-processing framework was implemented. First, trajectory points were sampled every 5th frame to reduce noise and suppress transient errors. Second, a moving average filter (window size = 3) was applied to smooth the path while preserving directional changes. Third, orientation correction was performed by calculating the trajectory slope and aligning the bounding box angle with the true movement direction. This process significantly improved the temporal and spatial consistency of trajectories, enabling accurate speed and acceleration estimation. Frame-by-frame analysis of speed and acceleration, commonly used in conflict studies, serves as a practical validation method where ground-truth dynamics are unavailable, helping assess the realism of derived motion profiles.\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e presents speed and acceleration profiles of three vehicles engaged in distinct maneuvers: (a) a straight-moving vehicle, (b) a left-turning vehicle, and (c) a right-turning vehicle. The data, sampled every 10th frame, reflects realistic driving behavior and serves as a qualitative validation of the computed motion parameters.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe straight-moving vehicle (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e \u003cb\u003e(a)\u003c/b\u003e) shows a sharp speed increase exceeding 40 mph, with a peak acceleration of nearly 8 mph², followed by a drop to zero indicating the vehicle reached cruising speed. This pattern aligns with typical behavior after crossing an intersection or merging into faster traffic. In contrast, both the left-turning (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e\u003cb\u003e(b)\u003c/b\u003e) and right-turning (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e\u003cb\u003e(c)\u003c/b\u003e) vehicles reach lower peak speeds of ~ 23 mph and ~ 17 mph, respectively consistent with cautious turning behavior. Their acceleration profiles are more moderate (2.5–3.5 mph²) and exhibit smoother fluctuations across frames, reflecting gradual throttle adjustments during curved movements.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003eGeneralizability of Vision-Based Trajectory Estimation Across Diverse Intersections\u003c/h2\u003e\u003cp\u003eTo evaluate the generalization capability of our proposed detection and tracking framework, we tested the YOLOv8-OBB-based model on a completely disjoint and previously unseen dataset, which contained no images from the original MCRUD dataset. This secondary dataset was captured at a different intersection located on Durango Drive in Nevada. This secondary site featured different scene composition, camera angle, and mounting height, resulting in noticeably different BEV distortions. Testing on this dataset allowed us to assess the framework’s adaptability to varied spatial configurations, a critical but often overlooked aspect in prior studies.\u003c/p\u003e\u003cp\u003eThe performance scores are lower compared to those obtained on the MCRUD dataset (which achieved a precision of 95%, recall of 97%, [email protected] of 99%, and [email protected]:0.95 of 92%), the model still performs reasonably well on this disjoint dataset, attaining a precision of 86%, recall of 82%, [email protected] of 78%, and [email protected]:0.95 of 71%. The drop in accuracy can be attributed to several factors such as vehicles in the Durango Drive footage appear significantly smaller due to the higher camera position, and both lighting variation and scene-specific distortion have affected detection accuracy. Despite these challenges, the model demonstrated strong generalization performance. It is generally accepted in object detection literature that an [email protected] above 70% is considered good, and an [email protected]:0.95 above 50% indicates robust performance across varying intersection-over-union thresholds (\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e). Therefore, the reported results underscore the model's reliability even in unseen and visually different environments.\u003c/p\u003e\u003cp\u003eThe figure illustrates detection outputs from the proposed YOLOv8-OBB model on several frames of the disjoint test dataset from the Durango Drive intersection. In Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u003cb\u003e(a)\u003c/b\u003e through \u003cb\u003e(d)\u003c/b\u003e, the model demonstrates high confidence in detecting vehicles such as cars and trucks, with confidence scores consistently exceeding 0.90 indicating robust performance even in a visually different environment.\u003c/p\u003e\u003cp\u003eNotably, the detection of a pedestrian (visible in frames b and c) highlights the model’s dynamic adaptation. Initially, the pedestrian is either missed (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u003cb\u003e(a)\u003c/b\u003e) or assigned a low confidence score of 0.30 (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u003cb\u003e(b)\u003c/b\u003e), primarily due to the rarity of pedestrian instances in the training set and the tendency of the model to confuse small static objects like pedestrians with crosswalk markings. However, as the pedestrian continues to move across subsequent frames, the model correctly identifies the object and increases its confidence to 0.81 Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u003cb\u003e(c)\u003c/b\u003e. This progression underscores the value of motion cues in distinguishing dynamic objects from static background elements.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"DISCUSSION AND CONCLUSION","content":"\u003cp\u003eThis study introduced a comprehensive vision-based framework for detecting, tracking, and estimating vehicle trajectories using BEV-transformed traffic camera footage. One of the core contributions lies in developing a YOLOv8-OBB-based detection model that operates directly on BEV images without requiring architectural modification. Unlike prior approaches that relied on complex 2D-to-3D conversions or perspective transformations tied to calibrated views, the proposed method streamlines BEV detection as a plug-and-play solution, compatible with standard object detection pipelines. This not only enhances scalability but also opens up BEV-based traffic analysis to a wider range of researchers and practitioners without the need for in-depth geometric modeling expertise.\u003c/p\u003e\u003cp\u003eThe novel space-oriented bounding box derived using interpretable vehicle features such as headlights, tires, and geometric distortions enabled the system to estimate the actual occupied space more precisely than conventional bounding boxes. This spatial accuracy is crucial for realistic trajectory estimation, particularly in BEV scenes where objects exhibit non-standard shapes and orientations due to distortion. Training and validation results confirmed the robustness of the model, which achieved high performance across all major detection metrics: 95% precision, 97% recall, 99% [email protected], and 92% [email protected]:0.95 on the primary MCRUD dataset. The model showed excellent convergence behavior and strong generalization to validation data, supported by aligned loss curves and high classification accuracy across all classes.\u003c/p\u003e\u003cp\u003eIn real-world traffic environments, smooth and physically consistent trajectories are essential for applications such as speed estimation, conflict analysis, and risk-based traffic management. The proposed trajectory post-processing framework based on temporal sampling, moving average smoothing, and orientation correction successfully mitigated frame-to-frame jitter and instability during turning maneuvers. The resulting speed and acceleration profiles captured the distinct motion characteristics of straight-moving versus turning vehicles, with values aligning well with expected real-world driving behavior. These dynamics serve as both validation of the motion pipeline and a foundation for future integration into conflict detection algorithms.\u003c/p\u003e\u003cp\u003ePerhaps most importantly, the framework was evaluated on a completely disjoint intersection dataset with different camera angles, heights, and visual context from the training data. Although the accuracy was slightly reduced on this dataset (precision: 86%, recall: 82%, [email protected]:78%, [email protected]:0.95: 71%), these values still meet widely accepted thresholds in object detection research. The performance drop was primarily due to smaller vehicle sizes, lighting variation, and higher camera elevation, which affected the fidelity of visual cues in BEV space. Notably, the model’s detection improved dynamically over time for rare and underrepresented classes such as pedestrians, showing increased confidence as motion cues became available. This highlights the model’s ability to adapt based on temporal context an important trait for real-time monitoring systems.\u003c/p\u003e\u003cp\u003eThe proposed framework demonstrates strong potential for real-world implementation, offering a calibration-free and scalable solution for vehicle detection and trajectory estimation using existing CCTV infrastructure. Its ability to generalize across varied camera setups makes it suitable for large-scale deployment by transportation agencies and ITS centers. By accurately capturing vehicle occupancy and motion dynamics in BEV space, the framework enables proactive safety analysis, near-miss detection, and data-driven intersection design. This study advances vision-based traffic analysis by bridging deep learning detection methods with practical transportation applications. While the model performed robustly, future work should focus on expanding the training dataset to include varied intersection geometries, camera perspectives, and lighting conditions. Enhancing the representation of pedestrians and cyclists, along with exploring domain adaptation techniques, will further improve model generalizability and support broader applicability in real-world contexts.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eFUNDING\u003c/h2\u003e\u003cp\u003eThe authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the University of North Dakota (UND) Early Career Award program (grant no. U31770-2710-1289).\u003c/p\u003e\u003ch2\u003eAUTHOR CONTRIBUTIONS\u003c/h2\u003e\u003cp\u003eThe authors confirm contribution to the paper as follows: study conception and design: Gaweesh and Abdelhadi; data collection: Roy; analysis and interpretation of results: Roy and Gaweesh; draft manuscript preparation: Roy, Gaweesh, and Abdelhadi. All authors reviewed the results and approved the final version of the manuscript.\u003c/p\u003e\u003ch2\u003eACKNOWLEDGMENT\u003c/h2\u003e\u003cp\u003eThe authors gratefully acknowledge the University of North Dakota (UND) for supporting this research through the Early Career Faculty Award program, which made this work possible.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAl-Omari MMA, Abdel-Aty M (2023) Evaluation of Driving Behavior and Traffic Safety at a Shifting Movements Intersection. Transp Res Rec 2677(1):1228\u0026ndash;1242. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/03611981221103865/ASSET/1231EBEC-7EB7-412C-9BED-81EA7DE2EF23/ASSETS/IMAGES/LARGE/10.1177_03611981221103865-FIG11.JPG\u003c/span\u003e\u003cspan address=\"10.1177/03611981221103865/ASSET/1231EBEC-7EB7-412C-9BED-81EA7DE2EF23/ASSETS/IMAGES/LARGE/10.1177_03611981221103865-FIG11.JPG\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu Y, Alsaleh R, Sayed T (2024) Modelling Motorized and Non-Motorized Vehicle Conflicts Using Multiagent Inverse Reinforcement Learning Approach. Taylor Francis 12(1). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/21680566.2024.2314762\u003c/span\u003e\u003cspan address=\"10.1080/21680566.2024.2314762\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang Y, Liu L T. Z.-I. C. on Autonomous, and undefined 2022. Extracting Traffic Conflict at Urban Intersection Using Deep Learning Trajectory Detection. \u003cem\u003eSpringerY\u003c/em\u003e Zhang, L Liu, \u003cem\u003eT ZhuInternational Conference on Autonomous Unmanned Systems, 2022\u0026bull;Springer\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eReyad P, Sayed T, Essa M, Zheng L (2022) Real-Time Crash-Risk Optimization at Signalized Intersections. Transp Res Rec 2676(12):32\u0026ndash;50. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/03611981211062891\u003c/span\u003e\u003cspan address=\"10.1177/03611981211062891\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGaweesh SM, Ahmed I, Ahmed MM (2023) Analysis Framework to Assess Crash Severity for Large Trucks on Rural Interstate Roads Utilizing the Latent Class and Random Parameter Model. \u003cem\u003ejournals.sagepub.com\u003c/em\u003e, Vol. 2677, No. 9, pp. 130\u0026ndash;150. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/03611981231158627\u003c/span\u003e\u003cspan address=\"10.1177/03611981231158627\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGaweesh SM, Bakhshi AK, Ahmed MM (2021) Safety Performance Assessment of Connected Vehicles in Mitigating the Risk of Secondary Crashes: A Driving Simulator Study. \u003cem\u003ejournals.sagepub.com\u003c/em\u003e, Vol. 2675, No. 12, pp. 117\u0026ndash;129. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/03611981211027881\u003c/span\u003e\u003cspan address=\"10.1177/03611981211027881\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWu Y, Abdel-Aty M, Zheng O, Cai Q, Zhang S (2020) Automated Safety Diagnosis Based on Unmanned Aerial Vehicle Video and Deep Learning Algorithm. Transp Res Rec 2674(8):350\u0026ndash;359. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/0361198120925808\u003c/span\u003e\u003cspan address=\"10.1177/0361198120925808\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYao R, Zeng W, Chen Y, He Z (2021) A Deep Learning Framework for Modelling Left-Turning Vehicle Behaviour Considering Diagonal-Crossing Motorcycle Conflicts at Mixed-Flow Intersections. Transp Res Part C: Emerg Technol 132. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.TRC.2021.103415\u003c/span\u003e\u003cspan address=\"10.1016/J.TRC.2021.103415\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZheng L, Fan S, Ma S, Research HJ-T (2024) and undefined Multi-Type Traffic Conflict Identification at Signalized Intersections Based on LiDAR Point Cloud. \u003cem\u003ejournals.sagepub.comL Zheng, S Fan, S Ma, H JiaoTransportation Research Record, 2024\u0026bull;journals.sagepub.com\u003c/em\u003e, Vol. 2024, No. 10, 2024, pp. 916\u0026ndash;925. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/03611981241235178\u003c/span\u003e\u003cspan address=\"10.1177/03611981241235178\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOh J, Min J, Kim M, Cho H, Oh J, Kim M, Cho H (2009) Development of an Automatic Traffic Conflict Detection System Based on Image Tracking Technology. \u003cem\u003ejournals.sagepub.comJ Oh, J Min, M Kim, H ChoTransportation Research Record, 2009\u0026bull;journals.sagepub.com\u003c/em\u003e, Vol. 2129, No. 2129, pp. 45\u0026ndash;54. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3141/2129-06\u003c/span\u003e\u003cspan address=\"10.3141/2129-06\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSayed T, Ismail K, Zaki M, Autey J (2012) Feasibility of Computer Vision-Based Safety Evaluations. Transp Res Rec No. 2280:18\u0026ndash;27. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3141/2280-03\u003c/span\u003e\u003cspan address=\"10.3141/2280-03\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSaunier N, Record TS-TR (2007) and undefined Automated Analysis of Road Safety with Video Data. \u003cem\u003ejournals.sagepub.comN Saunier, T SayedTransportation Research Record, 2007\u0026bull;journals.sagepub.com\u003c/em\u003e, No. 2019, 2007, pp. 57\u0026ndash;64. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3141/2019-08\u003c/span\u003e\u003cspan address=\"10.3141/2019-08\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSt-Aubin P, Saunier N, Miranda-Moreno L (2015) Transp Res Part C: Emerg Technol 58:363\u0026ndash;379. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.trc.2015.04.007\u003c/span\u003e\u003cspan address=\"10.1016/j.trc.2015.04.007\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Large-Scale Automated Proactive Road Safety Analysis Using Video Data\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMohamed A, Li L, Ahmed MM (2023) \u003cem\u003eAutomated Traffic Safety Assessment Tool Utilizing Monocular 3-D Convolutional Neural Network-Based Detection Algorithm at Signalized Intersections\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKronprasert N, Sutheerakul C, Satiennam T, Luathep P (2021) Intersection Safety Assessment Using Video-Based Traffic Conflict Analysis: The Case Study of Thailand. Sustain (Switzerland) 13(22). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/su132212722\u003c/span\u003e\u003cspan address=\"10.3390/su132212722\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMishra A, Chen K, Poddar S, Posadas E, Rangarajan A, Ranka S (2022) Using Video Analytics to Improve Traffic Intersection Safety and Performance. \u003cem\u003eVehicles\u003c/em\u003e, Vol. 4, No. 4, pp. 1288\u0026ndash;1313. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/vehicles4040068\u003c/span\u003e\u003cspan address=\"10.3390/vehicles4040068\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXie E, 2⋆ ZY, Zhou D, Philion J, Anandkumar A, Fidler S, 1⋆ PL, Alvarez JM \u003cem\u003eM 2 BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird\u0026rsquo;s-Eye View Representation\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhu M, Zhang S, Zhong Y, Lu P, Peng H (2021) and J. Lenneman. Monocular 3D Vehicle Detection Using Uncalibrated Traffic Cameras through Homography\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRashed H, Essam M, Mohamed M, Sallab AE, Yogamani S (2021) BEV-MODNet: Monocular Camera Based Bird\u0026rsquo;s Eye View Moving Object Detection for Autonomous Driving\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHe T S. S.-P. of the A. C. on Artificial, and undefined 2019. Mono3d++: Monocular 3d Vehicle Detection with Two-Scale 3d Hypotheses and Task Priors. \u003cem\u003eojs.aaai.orgT He, S SoattoProceedings of the AAAI Conference on Artificial Intelligence, 2019\u0026bull;ojs.aaai.org\u003c/em\u003e, p. 19\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTang X, Wang W, Song H, Zhao C (2023) CenterLoc3D: Monocular 3D Vehicle Localization Network for Roadside Surveillance Cameras. \u003cem\u003eComplex and Intelligent Systems\u003c/em\u003e, Vol. 9, No. 4, pp. 4349\u0026ndash;4368. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s40747-022-00962-9\u003c/span\u003e\u003cspan address=\"10.1007/s40747-022-00962-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu W, Li Q, Yang W, Cai J, Yu Y, Ma Y, He S, Pan J (2022) Monocular BEV Perception of Road Scenes via Front-to-Top View Projection\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi J, Chen S, Zhang F, Li E, Yang T, Lu Z (2019) An Adaptive Framework for Multi-Vehicle Ground Speed Estimation in Airborne Videos. Remote Sens 11(10). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/RS11101241\u003c/span\u003e\u003cspan address=\"10.3390/RS11101241\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLinlin Zhang Student C, Yu X, Student P, Daud A A. Rashid Mussah Student, and A.-G. Associate Professor. \u003cem\u003eApplication of 2D Homography for High Resolution Traffic Data Collection Using CCTV\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYang G, Ahmed M, Gaweesh S (2020) Adomah. Connected Vehicle Real-Time Traveler Information Messages for Freeway Speed Harmonization under Adverse Weather Conditions: Trajectory Level Analysis Using Driving Simulator. Accid Anal Prev 146. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.aap.2020.105707\u003c/span\u003e\u003cspan address=\"10.1016/j.aap.2020.105707\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eQin Z, Wang J Y. L.-P. of the A. conference on artificial, and undefined 2019. Monogrnet: A Geometric Reasoning Network for Monocular 3d Object Localization. \u003cem\u003eaaai.orgZ Qin\u003c/em\u003e, J Wang, \u003cem\u003eY LuProceedings of the AAAI conference on artificial intelligence, 2019\u0026bull;aaai.org\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJia J, Li Z, Shi Y \u003cem\u003eMonoUNI: A Unified Vehicle and Infrastructure-Side Monocular 3D Object Detection Network with Sufficient Depth Clues\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePalffy A, Pool E, Baratam S, Kooij JFP, Gavrila DM \u003cem\u003eMulti-Class Road User Detection with 3\u0026thinsp;+\u0026thinsp;1D Radar in the View-of-Delft Dataset\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCarta S, Castrill\u0026oacute;n-Santana M, Marras M, Mohamed S, Podda AS, Saia R, Sau M (2024) and W. Zimmer. RoadSense3D: A Framework for Roadside Monocular 3D Object Detection\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eIbrahim MR, TopView (2024) Vectorising Road Users in a Bird\u0026rsquo;s Eye View from Uncalibrated Street-Level Imagery with Deep Learning. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s00521-025-11152-2\u003c/span\u003e\u003cspan address=\"10.1007/s00521-025-11152-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRezaei M, Azarmi M (2023) Mir. 3D-Net: Monocular 3D Object Recognition for Traffic Monitoring. Expert Syst Appl 227. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.eswa.2023.120253\u003c/span\u003e\u003cspan address=\"10.1016/j.eswa.2023.120253\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKrajewski R, Bock J, Kloeker L, Eckstein L (2018) The highD Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on German Highways for Validation of Highly Automated Driving Systems. \u003cem\u003eIEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC\u003c/em\u003e, Vol. 2018-November, pp. 2118\u0026ndash;2125. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ITSC.2018.8569552\u003c/span\u003e\u003cspan address=\"10.1109/ITSC.2018.8569552\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHu H-N, Cai Q-Z, Wang D, Lin J, Sun M, Kr\u0026auml;henb\u0026uuml;hl P, Darrell T, Yu F \u003cem\u003eJoint Monocular 3D Vehicle Detection and Tracking\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBarmpounakis E, Geroliminis N (2020) On the New Era of Urban Traffic Monitoring with Massive Drone Data: The pNEUMA Large-Scale Field Experiment. Transp Res Part C: Emerg Technol 111:50\u0026ndash;71. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.TRC.2019.11.023\u003c/span\u003e\u003cspan address=\"10.1016/J.TRC.2019.11.023\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKhan, M., W. Ectors, \u0026hellip; T. B.-T., and undefined 2017. Unmanned Aerial Vehicle\u0026ndash;Based Traffic Analysis: Methodological Framework for Automated Multivehicle Trajectory Extraction.\u003cem\u003ejournals.sagepub.comMA Khan, W Ectors, T Bellemans, D Janssens, G WetsTransportation research record, 2017\u0026bull;journals.sagepub.com\u003c/em\u003e, Vol. 2626, 2017, pp. 25\u0026ndash;33. https://doi.org/10.3141/2626-04\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYohannes E, Lin CY, Shih TK, Thaipisutikul T, Enkhbat A (2023) Utaminingrum. An Improved Speed Estimation Using Deep Homography Transformation Regression Network on Monocular Videos. IEEE Access 11:5955\u0026ndash;5965. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ACCESS.2023.3236512\u003c/span\u003e\u003cspan address=\"10.1109/ACCESS.2023.3236512\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKim Y, Kum D (2019) \u003cem\u003eDeep Learning Based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCoifman B, Li L (2017) A Critical Evaluation of the Next Generation Simulation (NGSIM) Vehicle Trajectory Dataset. Transp Res Part B: Methodological 105:362\u0026ndash;377. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.TRB.2017.09.018\u003c/span\u003e\u003cspan address=\"10.1016/J.TRB.2017.09.018\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSoma K, Shibu L, Meenakshi N A Real-Time Vehicle Detection and Speed Estimation Using YOLO V8. (2024) \u003cem\u003eInternational Conference on Advances in Data Engineering and Intelligent Computing Systems, ADICS 2024\u003c/em\u003e, 2024. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ADICS58448.2024.10533551\u003c/span\u003e\u003cspan address=\"10.1109/ADICS58448.2024.10533551\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eIslam N, Ray SK, Hossain MA, Rashidul Hasan MAFM, Alamin and M. B. Al Zabir Shammo. Vehicle Classification and Detection Using YOLOv8: A Study on Highway Traffic Analysis. \u003cem\u003eInternational Conference on Recent Progresses in Science, Engineering and Technology, ICRPSET\u003c/em\u003e (2024), 2024., 2024. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ICRPSET64863.2024.10955913\u003c/span\u003e\u003cspan address=\"10.1109/ICRPSET64863.2024.10955913\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMaity S, Chakraborty A, Singh PK, Sarkar R (2023) Performance Comparison of Various YOLO Models for Vehicle Detection: An Experimental Study. \u003cem\u003eLecture Notes in Networks and Systems\u003c/em\u003e, Vol. 787 LNNS, pp. 677\u0026ndash;684. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/978-981-99-6550-2_50\u003c/span\u003e\u003cspan address=\"10.1007/978-981-99-6550-2_50\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMohamed A, Ahmed M (2024) Towards Rapid Safety Assessment of Signalized Intersections: An in-Depth Comparison of Computer Vision Algorithms. Adv Transp Stud 4:101\u0026ndash;116. No. Special issue\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.53136/97912218167168\u003c/span\u003e\u003cspan address=\"10.53136/97912218167168\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDwyer B, Nelson J, Hansen TROBOFLOW \u003cem\u003eRoboflow (Version 1.0)\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://roboflow.com./\u003c/span\u003e\u003cspan address=\"https://roboflow.com./\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLevi K, Weiss Y (2004) Learning Object Detection from a Small Number of Examples: The Importance of Good Features. \u003cem\u003eProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition\u003c/em\u003e, Vol. 2. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/CVPR.2004.1315144\u003c/span\u003e\u003cspan address=\"10.1109/CVPR.2004.1315144\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOpenCV (2025) OpenCV Modules. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://docs.opencv.org/4.x/index.html\u003c/span\u003e\u003cspan address=\"https://docs.opencv.org/4.x/index.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Accessed Apr. 22\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eV8.1 (2025) 0 Release - YOLOv8 Oriented Bounding Boxes (OBB) \u0026middot; Ultralytics \u0026middot; Discussion #7472 \u0026middot; GitHub. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/orgs/ultralytics/discussions/7472\u003c/span\u003e\u003cspan address=\"https://github.com/orgs/ultralytics/discussions/7472\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Accessed Apr. 22\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMulti-Object Tracking with Ultralytics YOLO - Ultralytics YOLO Docs (2025) \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://docs.ultralytics.com/modes/track/\u003c/span\u003e\u003cspan address=\"https://docs.ultralytics.com/modes/track/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Accessed Apr. 22\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYe Z, Zhang H, Gu J, Li X (2023) YOLOv7-3D: A Monocular 3D Traffic Object Detection Method from a Roadside Perspective. Appl Sci (Switzerland) 13(20). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/app132011402\u003c/span\u003e\u003cspan address=\"10.3390/app132011402\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFarhat W, Ben Rhaiem O, Faiedh H, Souani. C (2025) YOLO-TSR: A Novel YOLOv8-Based Network for Robust Traffic Sign Recognition. Transp Res Rec. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/03611981251327213/ASSET/3FA720C8-6B9D-4791-8478-5B573347E6C0/ASSETS/IMAGES/LARGE/10.1177_03611981251327213-FIG16.JPG\u003c/span\u003e\u003cspan address=\"10.1177/03611981251327213/ASSET/3FA720C8-6B9D-4791-8478-5B573347E6C0/ASSETS/IMAGES/LARGE/10.1177_03611981251327213-FIG16.JPG\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJiao J, Wang H (2022) Traffic Behavior Recognition from Traffic Videos under Occlusion Condition: A Kalman Filter Approach. Transp Res Rec 2676(7):55\u0026ndash;65. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/03611981221076426;WGROUP:STRING:PUBLICATION\u003c/span\u003e\u003cspan address=\"10.1177/03611981221076426;WGROUP:STRING:PUBLICATION\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAn Introduction to BYTETrack (2025) Multi-Object Tracking by Associating Every Detection Box | Datature Blog. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://datature.io/blog/introduction-to-bytetrack-multi-object-tracking-by-associating-every-detection-box\u003c/span\u003e\u003cspan address=\"https://datature.io/blog/introduction-to-bytetrack-multi-object-tracking-by-associating-every-detection-box\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Accessed July 14\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKamil DA, Wahyono A, Harjoko (2024) Jo. Vehicle Speed Estimation Using Consecutive Frame Approaches and Deep Image Homography for Image Rectification on Monocular Videos. IEEE Access. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ACCESS.2024.3508135\u003c/span\u003e\u003cspan address=\"10.1109/ACCESS.2024.3508135\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYin F, Makris D, Velastin SA, Ellis T (2015) Calibration and Object Correspondence in Camera Networks with Widely Separated Overlapping Views. IET Comput Vision 9(3):354\u0026ndash;367. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1049/IET-CVI.2013.0301\u003c/span\u003e\u003cspan address=\"10.1049/IET-CVI.2013.0301\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"University of North Dakota","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":true,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Trajectory Estimation, Traffic Safety Analysis, YOLOv8-Oriented Bounding Box (OBB), Vehicle Detection and Tracking, Intelligent Transportation Systems (ITS)","lastPublishedDoi":"10.21203/rs.3.rs-7916322/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7916322/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe lack of scalable and cost-effective methods for extracting actionable vehicle trajectories from existing traffic CCTV infrastructure limits proactive traffic safety analysis. Traditional trajectory estimation approaches often rely on LiDAR, radar, or calibrated camera systems, which are costly and lack scalability. This study introduces a novel, plug-and-play framework for vision-based vehicle trajectory estimation using monocular CCTV footage, eliminating the need for camera calibration. The proposed system combines homography-based Bird Eye View (BEV) transformation with a You Look Only Once (YOLO) v8-Oriented Bounding Box (OBB) detection to estimate vehicle trajectories from traffic footage trained on a custom dataset. The framework introduces a novel custom-defined \u0026ldquo;space\u0026rdquo; bounding box that accurately captures the physical footprint of moving objects. It leverages visual cues, such as tire shadows and distortion patterns, effectively addressing challenges related to occlusion and distortions. The YOLOv8-OBB model, trained on the compiled dataset, achieves high performance with Mean Average Precision (mAP) @50\u0026ndash;95 of 0.92, precision and recall exceeding 0.95. Trajectory refinement was achieved through temporal sub-sampling, moving average smoothing, and slope-based orientation correction resulting in stable and physically realistic paths even during turns and visual occlusions. Calculated speed and acceleration profiles from refined trajectories align with real-world driving behavior, further validating the system\u0026rsquo;s accuracy. The pipeline was successfully tested on an unseen intersection demonstrating its generalizability across varied traffic geometries and perspectives. This work presents a scalable, calibration-free solution for trajectory-based traffic monitoring, with potential applications in conflict detection, traffic modeling, and intersection safety assessments using widely available surveillance infrastructure.\u003c/p\u003e","manuscriptTitle":"A Generalizable Vision-Based Framework for Vehicle Trajectory Estimation and Conflict Analysis at Intersections","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-10-23 05:10:34","doi":"10.21203/rs.3.rs-7916322/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"11fd5e32-e973-46b8-9f93-55bcbcf94b83","owner":[],"postedDate":"October 23rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":56675292,"name":"Civil Engineering"},{"id":56675293,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-10-23T05:10:34+00:00","versionOfRecord":[],"versionCreatedAt":"2025-10-23 05:10:34","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7916322","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7916322","identity":"rs-7916322","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00