Training custom machine learning models for specific business needs requires high-quality data, often sourced from human annotations. However, these annotations are error-prone, especially when it comes to videos. Uber Engineering has developed an ML-based system to address these. Bounding box annotation errorwhich aims to ensure data integrity before feeding it to model training.
Visual TL;DR. Manual annotation errors are costly and lead to inconsistencies. Manual annotation errors are resolved with Uber’s ML solution. Cost and inconsistency are the motivations for Uber’s ML solution. Uber’s ML solution uses uLabel integration. Uber’s ML solution tackles tricky video segments. Uber’s ML solutions power synthetic data. Uber’s ML solutions enable accurate ML training. Accurate ML training leads to improved model quality.
Manual annotation errors: Human annotators make mistakes in labeling the bounding box of a video
Expensive and inconsistent: Manual reviews double the cost, double the time, and lack consistency.
Uber’s ML solution: ML systems automatically detect and fix bounding box errors
uLabel Integration: Solution integrated with in-house annotation tool uLabel
Tricky video segments: Challenges arise when recombining video segments after annotation
Synthetic data: Use synthetic data for robust error detection
Accurate ML training: Ensure data integrity for high-quality ML models.
Improving model quality: Improving the performance and reliability of trained ML models
Visual TL;DR
The challenge is in video annotation. Video annotation divides long footage into segments for the operator, introducing the possibility of mistakes during the recombination process. Traditional human review workflows are costly and inconsistent. Uber’s solution is integrated with our in-house tool uLabel and provides real-time automated verification.
Problems with manual review
Human annotators can make mistakes. A second pair of eyes would help, but would double the cost and time. This series of processes is inefficient for large projects.
Uber’s ML-powered solutions
Uber’s system automatically detects critical annotation errors such as ID swaps (trackers incorrectly tracking the wrong object) and position jumps (unexplained shifts in coordinates). According to the Uber Engineering blog, these are the most common and impactful failures.
Why is it tricky?
Detecting these errors is not easy. What looks like an error in one context may be normal in another. Object size, motion, camera movement, scene complexity, and even frame rate all affect the composition of anomalies. A 10 pixel shift is negligible to a car, but significant to a distant pedestrian.
Fixed rules such as “flag jumps over X pixels” are insufficient because they cannot adapt to changing conditions.
Architecture for accuracy
The validation pipeline uses an 11-frame sliding window to analyze features across visual, motion, and coordinate data. The XGBoost classifier then scores the error probability for each frame.
This approach processes raw videos and annotations, extracts features, classifies potential errors, and categorizes them into actionable groups for human review.
Synthetic data for increased robustness
Because errors are rare in the real world, Uber generates synthetic data by introducing perturbations that mimic human mistakes. This includes simulating ID swaps and position jumps over various sizes and distances.
This synthetic dataset is derived from six open source datasets, ensuring the system is generalizable across a variety of scenarios, from autonomous driving to crowded scenes. It is important to focus on improving the quality of machine learning data labeling.
In-tool validation and future planning
The system flags issues directly in uLabel, allowing the operator to fix the issue or reject the suggestion. Uber has already implemented this solution across its bounding box annotation project and plans to expand it to cover more error types and further improve the quality of video annotations.
This automatic validation significantly improves data quality, streamlines workflows, and contributes to more robust machine learning and robotics systems.