A detailed look at the development process, personal goals, and technical approach behind this computer vision project
After talking to logistics managers, I realized how much money companies lose on incorrect package dimensions. Manual measurements are slow and error-prone, automated systems are expensive, and master data is often just wrong. This seemed like a perfect use case for computer vision - if I could make it actually work reliably.
Shipping and manufacturing companies lose millions annually due to incorrect package dimensions in their master data. Current solutions require expensive 3D scanners or manual measurement - but what if you could just take a photo with a standard reference object and get accurate measurements instantly?
I started by researching state-of-the-art models for each component. Rather than building from scratch, I chose to combine proven models: GroundingDINO for text-based object detection, SAM for precise segmentation, and Depth Anything V2 for monocular depth estimation.
The biggest challenge was making these models work together seamlessly. Each model expects different input formats and produces outputs in different coordinate systems. I spent significant time on coordinate transformations and ensuring data consistency between pipeline stages.
The core innovation was developing a robust calibration system using reference objects. I implemented confidence intervals, outlier detection, and perspective correction to make measurements as accurate as possible.
The coordinate transformation between GroundingDINO's bounding boxes and SAM's segmentation masks, then mapping both to the depth map's coordinate system while preserving spatial accuracy.
Prompt: "laptop" | Reference: coffee mug (8cm height)

Depth Anything V2: Monocular depth estimation

SAM: Precise object segmentation masks

Combined: Final analysis visualization
Final Result: Laptop dimensions estimated as 35.2cm × 23.8cm (±30% accuracy). Actual laptop dimensions: 30.5cm × 21.5cm → Error: ~15% (within expected range)
Prompt: "car" | Reference: street lamp (3m height)
Example 2: Coming soon with street scene data
Multiple cars at different depths
Varying measurement accuracy by distance
This example demonstrated the system's ability to handle multiple objects at different depths. The closer car was measured with 25% accuracy, while distant cars had 40% uncertainty due to depth resolution limitations.
Each model outputs different coordinate systems. Spent 3 days debugging why bounding boxes were offset by exactly 50 pixels.
Monocular depth has no absolute scale. A laptop could be 30cm or 3m away - you literally can't tell without a reference.
Laptop screens, glass surfaces, and shiny objects completely break depth estimation. Models weren't trained for this.
One bad detection early in the pipeline ruins everything downstream. Error handling is more important than accuracy.
A coffee mug works better than a credit card. Circular objects at similar depth to target work best.
Perfect center images work great. Image edges, weird angles, or poor lighting? System falls apart fast.
State-of-the-art models ≠ production-ready system. The hardest part wasn't the AI - it was making three different models work together reliably, handling failures gracefully, and getting consistent results on real photos. Integration and error handling took 80% of the development time.
Use multiple reference objects to improve accuracy and handle scale variations across the scene.
Optimize the pipeline for video streams and mobile deployment using model quantization.
Integrate with AR frameworks for live measurement overlay on camera feeds.