DepthP+P: Metric Accurate Monocular Depth Estimation using Planar and Parallax
Year 2025,
Volume: 6 Issue: 2, 20 - 29
Sadra Safadoust
,
Fatma Guney
Abstract
Current self-supervised monocular depth estimation methods are mostly based on estimating a rigid-body motion representing camera motion. These methods suffer from the well-known scale ambiguity problem in their predictions. We propose DepthP+P, a method that learns to estimate outputs in metric scale by following the traditional planar parallax paradigm. We first align the two frames using a common ground plane which removes the effect of the rotation component in the camera motion. With two neural networks, we predict the depth and the camera translation, which is easier to predict alone compared to predicting it together with rotation. By assuming a known camera height, we can then calculate the induced 2D image motion of a 3D point and use it for reconstructing the target image in a self-supervised monocular approach. We perform experiments on the KITTI driving dataset and show that the planar parallax approach, which only needs to predict camera translation, can be a metrically accurate alternative to the current methods that rely on estimating 6DoF camera motion.
References
-
T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsu-pervised learning of depth and ego-motion from video,”
in CVPR, pp. 1851–1858, 2017.
-
Sawhney, “3d geometry from planar parallax,” in CVPR, pp. 929–934, 1994.
-
M. Irani and P. Anandan, “Parallax geometry of pairs of points for 3d scene analysis,” in ECCV (B. Buxton
and R. Cipolla, eds.), (Berlin, Heidelberg), pp. 17–30, Springer Berlin Heidelberg, 1996.
-
C. Godard, O. Mac Aodha, M. Firman, and G. J. Bros-tow, “Digging into self-supervised monocular depth
estimation,” in ICCV, 2019.
-
M. Irani, P. Anandan, and M. Cohen, “Direct recoveryof planar-parallax from multiple frames,” PAMI, vol. 24,
no. 11, pp. 1528–1534, 2002.
-
R. Garg, V. K. Bg, G. Carneiro, and I. Reid, “Unsuper-vised CNN for single view depth estimation: Geometry
to the rescue,” in ECCV, pp. 740–756, 2016.
-
C. Godard, O. Mac Aodha, and G. J. Brostow, “Un-supervised monocular depth estimation with left-right
consistency,” in CVPR, pp. 270–279, 2017.
-
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in
NeurIPS, pp. 2017–2025, 2015.
-
H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agar-wal, and I. Reid, “Unsupervised learning of monocular
depth estimation and visual odometry with deep feature reconstruction,” in CVPR, pp. 340–349, 2018.
-
R. Mahjourian, M. Wicke, and A. Angelova, “Unsuper-vised learning of depth and ego-motion from monoc-
ular video using 3d geometric constraints,” in CVPR, pp. 5667–5675, 2018.
-
C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct
methods,” in CVPR, pp. 2022–2030, 2018.
-
Z. Yin and J. Shi, “GeoNet: Unsupervised learning of dense depth, optical flow and camera pose,” in CVPR,
pp. 1983–1992, 2018.
-
Y. Zou, Z. Luo, and J.-B. Huang, “DF-Net: Unsuper-vised joint learning of depth and flow using cross-task
consistency,” in ECCV, pp. 36–53, 2018.
-
Y. Chen, C. Schmid, and C. Sminchisescu, “Self-supervised learning with geometric constraints in
monocular video: Connecting flow, depth, and cam-era,” in ICCV, pp. 7063–7072, 2019.
-
C. Luo, Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, and A. Yuille, “Every pixel counts++: Joint learning of
geometry and motion with 3d holistic understanding,” PAMI, vol. 42, no. 10, pp. 2624–2641, 2019.
-
A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black, “Competitive collabora-
tion: Joint unsupervised learning of depth, camera mo-tion, optical flow and motion segmentation,” in CVPR,
pp. 12240–12249, 2019.
-
S. Safadoust and F. G ¨uney, “Self-supervised monocular scene decomposition and depth estimation,” in Interna- tional Conference on 3D Vision (3DV), pp. 627–636, 2021.
-
V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3D packing for self-supervised monocular
depth estimation,” in CVPR, 2020.
-
J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent
depth and ego-motion learning from monocular video,”in NeurIPS, pp. 35–45, 2019.
-
T. Roussel, L. V. Eycken, and T. Tuytelaars, “Monocu-lar depth estimation in new environments with absolute
scale,” in IROS, pp. 1735–1741, 2019.
-
F. Bartoccioni, E. Zablocki, P. Perez, M. Cord, and K. Alahari, “Lidartouch: Monocular metric
depth estimation with a few-beam lidar,” arXiv.org, vol. 2109.03569, 2021.
-
F. Xue, G. Zhuo, Z. Huang, W. Fu, Z. Wu, and M. H. Ang, “Toward hierarchical self-supervised monocular
absolute depth estimation for autonomous driving appli-cations,” in IROS, pp. 2330–2337, IEEE, 2020.
-
B. Wagstaff and J. Kelly, “Self-supervised scale recov-ery for monocular depth and egomotion estimation,”
in 2021 IEEE/RSJ International Conference on Intelli-gent Robots and Systems (IROS), pp. 2620–2627, IEEE,
2021.
-
M. Irani, P. Anandan, and D. Weinshall, “From refer- ence frames to reference planes: Multi-view parallax
geometry and applications,” in ECCV (H. Burkhardt and B. Neumann, eds.), (Berlin, Heidelberg), pp. 829–845,
Springer Berlin Heidelberg, 1998.
-
J. Wulff, L. Sevilla-Lara, and M. J. Black, “Optical flow in mostly rigid scenes,” in CVPR, pp. 4671–4680, 2017.
[26] K. Chaney, A. Z. Zhu, and K. Daniilidis, “Learning event-based height from plane and parallax,” in IROS,
pp. 3690–3696, 2019.
-
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simon-celli, “Image quality assessment: from error visibility to
structural similarity,” TIP, vol. 13, no. 4, pp. 600–612, 2004.
-
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Con- volutional networks for biomedical image segmenta-
tion,” in MICCAI, pp. 234–241, 2015.
-
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, pp. 770–778,
2016.
-
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern-
stein, et al., “ImageNet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
-
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map pre-diction from a single image using a multi-scale deep
network,” in NeurIPS, pp. 2366–2374, 2014.
-
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” IJRR, 2013.
-
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,”
in CVPR, 2012.
-
J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant CNNs,” in 3DV,
2017.
-
Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in ECCV, pp. 402–419,
Springer, 2020.
-
Y. Zhu, K. Sapra, F. A. Reda, K. J. Shih, S. Newsam, A. Tao, and B. Catanzaro, “Improving semantic seg-
mentation via video propagation and label relaxation,” in CVPR, 2019.
-
Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia, “Unsupervised learning of geometry from videos with
edge-aware depth-normal consistency,” in AAAI, 2018.
-
Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia, “LEGO: Learning edge with geometry all at once by
watching videos,” in CVPR, pp. 225–234, 2018.
-
R. Li, S. Wang, Z. Long, and D. Gu, “Undeepvo: Monocular visual odometry through unsupervised deep
learning,” in ICRA, pp. 7286–7291, IEEE, 2018.