Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious pro...Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious problem.Researchers find that the blurry boundary is mainly caused by two factors.First,the low-level features,containing boundary and structure information,may be lost in deep networks during the convolution process.Second,themodel ignores the errors introduced by the boundary area due to the few portions of the boundary area in the whole area,during the backpropagation.Focusing on the factors mentioned above.Two countermeasures are proposed to mitigate the boundary blur problem.Firstly,we design a scene understanding module and scale transformmodule to build a lightweight fuse feature pyramid,which can deal with low-level feature loss effectively.Secondly,we propose a boundary-aware depth loss function to pay attention to the effects of the boundary’s depth value.Extensive experiments show that our method can predict the depth maps with clearer boundaries,and the performance of the depth accuracy based on NYU-Depth V2,SUN RGB-D,and iBims-1 are competitive.展开更多
Learning-based multi-task models have been widely used in various scene understanding tasks,and complement each other,i.e.,they allow us to consider prior semantic information to better infer depth.We boost the unsupe...Learning-based multi-task models have been widely used in various scene understanding tasks,and complement each other,i.e.,they allow us to consider prior semantic information to better infer depth.We boost the unsupervised monocular depth estimation using semantic segmentation as an auxiliary task.To address the lack of cross-domain datasets and catastrophic forgetting problems encountered in multi-task training,we utilize existing methodology to obtain redundant segmentation maps to build our cross-domain dataset,which not only provides a new way to conduct multi-task training,but also helps us to evaluate results compared with those of other algorithms.In addition,in order to comprehensively use the extracted features of the two tasks in the early perception stage,we use a strategy of sharing weights in the network to fuse cross-domain features,and introduce a novel multi-task loss function to further smooth the depth values.Extensive experiments on KITTI and Cityscapes datasets show that our method has achieved state-of-the-art performance in the depth estimation task,as well improved semantic segmentation.展开更多
Background Monocular depth estimation aims to predict a dense depth map from a single RGB image,and has important applications in 3D reconstruction,automatic driving,and augmented reality.However,existing methods dire...Background Monocular depth estimation aims to predict a dense depth map from a single RGB image,and has important applications in 3D reconstruction,automatic driving,and augmented reality.However,existing methods directly feed the original RGB image into the model to extract depth features without avoiding the interference of depth-irrelevant information on depth-estimation accuracy,which leads to inferior performance.Methods To remove the influence of depth-irrelevant information and improve the depth-prediction accuracy,we propose RADepthNet,a novel reflectance-guided network that fuses boundary features.Specifically,our method predicts depth maps using the following three steps:(1)Intrinsic Image Decomposition.We propose a reflectance extraction module consisting of an encoder-decoder structure to extract the depth-related reflectance.Through an ablation study,we demonstrate that the module can reduce the influence of illumination on depth estimation.(2)Boundary Detection.A boundary extraction module,consisting of an encoder,refinement block,and upsample block,was proposed to better predict the depth at object boundaries utilizing gradient constraints.(3)Depth Prediction Module.We use an encoder different from(2)to obtain depth features from the reflectance map and fuse boundary features to predict depth.In addition,we proposed FIFADataset,a depth-estimation dataset applied in soccer scenarios.Results Extensive experiments on a public dataset and our proposed FIFADataset show that our method achieves state-of-the-art performance.展开更多
This paper aims to address the problem of supervised monocular depth estimation.We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation.Moreover...This paper aims to address the problem of supervised monocular depth estimation.We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation.Moreover,the Transformer and convolution are good at long-range and close-range depth estimation,respectively.Therefore,we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch.The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents.However,independent branches lead to a shortage of connections between features.To bridge this gap,we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner.Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps,we adopt the deformable scheme to reduce the complexity.Extensive experiments on the KITTI,NYU,and SUN RGB-D datasets demonstrate that our proposed model,termed DepthFormer,surpasses state-of-the-art monocular depth estimation methods with prominent margins.The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.展开更多
Background Lack of depth perception from medical imaging systems is one of the long-standing technological limitations of minimally invasive surgeries.The ability to visualize anatomical structures in 3D can improve c...Background Lack of depth perception from medical imaging systems is one of the long-standing technological limitations of minimally invasive surgeries.The ability to visualize anatomical structures in 3D can improve conventional arthroscopic surgeries,as a full 3D semantic representation of the surgical site can directly improve surgeons’ability.It also brings the possibility of intraoperative image registration with preoperative clinical records for the development of semi-autonomous,and fully autonomous platforms.This study aimed to present a novel monocular depth prediction model to infer depth maps from a single-color arthroscopic video frame.Methods We applied a novel technique that provides the ability to combine both supervised and self-supervised loss terms and thus eliminate the drawback of each technique.It enabled the estimation of edge-preserving depth maps from a single untextured arthroscopic frame.The proposed image acquisition technique projected artificial textures on the surface to improve the quality of disparity maps from stereo images.Moreover,following the integration of the attention-ware multi-scale feature extraction technique along with scene global contextual constraints and multiscale depth fusion,the model could predict reliable and accurate tissue depth of the surgical sites that complies with scene geometry.Results A total of 4,128 stereo frames from a knee phantom were used to train a network,and during the pre-trained stage,the network learned disparity maps from the stereo images.The fine-tuned training phase uses 12,695 knee arthroscopic stereo frames from cadaver experiments along with their corresponding coarse disparity maps obtained from the stereo matching technique.In a supervised fashion,the network learns the left image to the disparity map transformation process,whereas the self-supervised loss term refines the coarse depth map by minimizing reprojection,gradients,and structural dissimilarity loss.Together,our method produces high-quality 3D maps with minimum re-projection loss that are 0.0004132(structural similarity index),0.00036120156(L1 error distance)and 6.591908×10^(−5)(L1 gradient error distance).Conclusion Machine learning techniques for monocular depth prediction is studied to infer accurate depth maps from a single-color arthroscopic video frame.Moreover,the study integrates segmentation model hence,3D segmented maps are inferred that provides extended perception ability and tissue awareness.展开更多
Based on well-designed network architectures and objective functions,self-supervised monocular depth estimation has made great progress.However,lacking a specific mechanism to make the network learn more about the reg...Based on well-designed network architectures and objective functions,self-supervised monocular depth estimation has made great progress.However,lacking a specific mechanism to make the network learn more about the regions containing moving objects or occlusion scenarios,existing depth estimation methods likely produce poor results for them.Therefore,we propose an uncertainty quantification method to improve the performance of existing depth estimation networks without changing their architectures.Our uncertainty quantification method consists of uncertainty measurement,the learning guidance by uncertainty,and the ultimate adaptive determination.Firstly,with Snapshot and Siam learning strategies,we measure the uncertainty degree by calculating the variance of pre-converged epochs or twins during training.Secondly,we use the uncertainty to guide the network to strengthen learning about those regions with more uncertainty.Finally,we use the uncertainty to adaptively produce the final depth estimation results with a balance of accuracy and robustness.To demonstrate the effectiveness of our uncertainty quantification method,we apply it to two state-of-the-art models,Monodepth2 and Hints.Experimental results show that our method has improved the depth estimation performance in seven evaluation metrics compared with two baseline models and exceeded the existing uncertainty method.展开更多
Depth information is important for autonomous systems to perceive environments and estimate their own state. Traditional depth estimation methods, like structure from motion and stereo vision matching, are built on fe...Depth information is important for autonomous systems to perceive environments and estimate their own state. Traditional depth estimation methods, like structure from motion and stereo vision matching, are built on feature correspondences of multiple viewpoints. Meanwhile, the predicted depth maps are sparse. Inferring depth information from a single image(monocular depth estimation) is an ill-posed problem. With the rapid development of deep neural networks, monocular depth estimation based on deep learning has been widely studied recently and achieved promising performance in accuracy. Meanwhile, dense depth maps are estimated from single images by deep neural networks in an end-to-end manner. In order to improve the accuracy of depth estimation, different kinds of network frameworks, loss functions and training strategies are proposed subsequently. Therefore, we survey the current monocular depth estimation methods based on deep learning in this review. Initially, we conclude several widely used datasets and evaluation indicators in deep learning-based depth estimation. Furthermore, we review some representative existing methods according to different training manners: supervised, unsupervised and semi-supervised. Finally, we discuss the challenges and provide some ideas for future researches in monocular depth estimation.展开更多
Self-supervised monocular depth estimation has been widely investigated and applied in previous works.However,existing methods suffer from texture-copy,depth drift,and incomplete structure.It is difficult for normal C...Self-supervised monocular depth estimation has been widely investigated and applied in previous works.However,existing methods suffer from texture-copy,depth drift,and incomplete structure.It is difficult for normal CNN networks to completely understand the relationship between the object and its surrounding environment.Moreover,it is hard to design the depth smoothness loss to balance depth smoothness and sharpness.To address these issues,we propose a coarse-to-fine method with a normalized convolutional block attention module(NCBAM).In the coarse estimation stage,we incorporate the NCBAM into depth and pose networks to overcome the texture-copy and depth drift problems.Then,we use a new network to refine the coarse depth guided by the color image and produce a structure-preserving depth result in the refinement stage.Our method can produce results competitive with state-of-the-art methods.Comprehensive experiments prove the effectiveness of our two-stage method using the NCBAM.展开更多
Recent advances in computer vision and deep learning have shown that the fusion of depth information can significantly enhance the performance of RGB-based damage detection and segmentation models.However,alongside th...Recent advances in computer vision and deep learning have shown that the fusion of depth information can significantly enhance the performance of RGB-based damage detection and segmentation models.However,alongside the advantages,depth-sensing also presents many practical challenges.For instance,the depth sensors impose an additional payload burden on the robotic inspection platforms limiting the operation time and increasing the inspection cost.Additionally,some lidar-based depth sensors have poor outdoor performance due to sunlight contamination during the daytime.In this context,this study investigates the feasibility of abolishing depth-sensing at test time without compromising the segmentation performance.An autonomous damage segmentation framework is developed,based on recent advancements in vision-based multi-modal sensing such as modality hallucination(MH)and monocular depth estimation(MDE),which require depth data only during the model training.At the time of deployment,depth data becomes expendable as it can be simulated from the corresponding RGB frames.This makes it possible to reap the benefits of depth fusion without any depth perception per se.This study explored two different depth encoding techniques and three different fusion strategies in addition to a baseline RGB-based model.The proposed approach is validated on computer-generated RGB-D data of reinforced concrete buildings subjected to seismic damage.It was observed that the surrogate techniques can increase the segmentation IoU by up to 20.1%with a negligible increase in the computation cost.Overall,this study is believed to make a positive contribution to enhancing the resilience of critical civil infrastructure.展开更多
Existing depth completion methods are often targeted at a specific sparse depth type and generalize poorly across task domains.We present a method to complete sparse/semi-dense,noisy,and potentially low-resolution dep...Existing depth completion methods are often targeted at a specific sparse depth type and generalize poorly across task domains.We present a method to complete sparse/semi-dense,noisy,and potentially low-resolution depth maps obtained by various range sensors,including those in modern mobile phones,or by multi-view reconstruction algorithms.Our method leverages a data-driven prior in the form of a single image depth prediction network trained on large-scale datasets,the output of which is used as an input to our model.We propose an effective training scheme where we simulate various sparsity patterns in typical task domains.In addition,we design two new benchmarks to evaluate the generalizability and robustness of depth completion methods.Our simple method shows superior cross-domain generalization ability against state-of-the-art depth completion methods,introducing a practical solution to highqualitydepthcapture onamobile device.展开更多
Self-supervised depth estimation approaches present excellent results that are comparable to those of the fully supervised approaches,by employing view synthesis between the target and reference images in the training...Self-supervised depth estimation approaches present excellent results that are comparable to those of the fully supervised approaches,by employing view synthesis between the target and reference images in the training data.ResNet,which serves as a backbone network,has some structural deficiencies when applied to downstream fields,because its original purpose was to cope with classification problems.The low-texture area also deteriorates the performance.To address these problems,we propose a set of improvements that lead to superior predictions.First,we boost the information flow in the network and improve the ability to learn spatial structures by improving the network structures.Second,we use a binary mask to remove the pixels in low-texture areas between the target and reference images to more accurately reconstruct the image.Finally,we input the target and reference images randomly to expand the dataset and pre-train it on ImageNet,so that the model obtains a favorable general feature representation.We demonstrate state-of-the-art performance on an Eigen split of the KITTI driving dataset using stereo pairs.展开更多
基金supported in part by School Research Projects of Wuyi University (No.5041700175).
文摘Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious problem.Researchers find that the blurry boundary is mainly caused by two factors.First,the low-level features,containing boundary and structure information,may be lost in deep networks during the convolution process.Second,themodel ignores the errors introduced by the boundary area due to the few portions of the boundary area in the whole area,during the backpropagation.Focusing on the factors mentioned above.Two countermeasures are proposed to mitigate the boundary blur problem.Firstly,we design a scene understanding module and scale transformmodule to build a lightweight fuse feature pyramid,which can deal with low-level feature loss effectively.Secondly,we propose a boundary-aware depth loss function to pay attention to the effects of the boundary’s depth value.Extensive experiments show that our method can predict the depth maps with clearer boundaries,and the performance of the depth accuracy based on NYU-Depth V2,SUN RGB-D,and iBims-1 are competitive.
基金This work was supported by the national key research development plan(Project No.YS2018YFB1403703)research project of the communication university of china(Project No.CUC200D058).
文摘Learning-based multi-task models have been widely used in various scene understanding tasks,and complement each other,i.e.,they allow us to consider prior semantic information to better infer depth.We boost the unsupervised monocular depth estimation using semantic segmentation as an auxiliary task.To address the lack of cross-domain datasets and catastrophic forgetting problems encountered in multi-task training,we utilize existing methodology to obtain redundant segmentation maps to build our cross-domain dataset,which not only provides a new way to conduct multi-task training,but also helps us to evaluate results compared with those of other algorithms.In addition,in order to comprehensively use the extracted features of the two tasks in the early perception stage,we use a strategy of sharing weights in the network to fuse cross-domain features,and introduce a novel multi-task loss function to further smooth the depth values.Extensive experiments on KITTI and Cityscapes datasets show that our method has achieved state-of-the-art performance in the depth estimation task,as well improved semantic segmentation.
基金Supported by the National Natural Science Foundation of China under Grants 61872241, 62077037 and 62077037Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0102。
文摘Background Monocular depth estimation aims to predict a dense depth map from a single RGB image,and has important applications in 3D reconstruction,automatic driving,and augmented reality.However,existing methods directly feed the original RGB image into the model to extract depth features without avoiding the interference of depth-irrelevant information on depth-estimation accuracy,which leads to inferior performance.Methods To remove the influence of depth-irrelevant information and improve the depth-prediction accuracy,we propose RADepthNet,a novel reflectance-guided network that fuses boundary features.Specifically,our method predicts depth maps using the following three steps:(1)Intrinsic Image Decomposition.We propose a reflectance extraction module consisting of an encoder-decoder structure to extract the depth-related reflectance.Through an ablation study,we demonstrate that the module can reduce the influence of illumination on depth estimation.(2)Boundary Detection.A boundary extraction module,consisting of an encoder,refinement block,and upsample block,was proposed to better predict the depth at object boundaries utilizing gradient constraints.(3)Depth Prediction Module.We use an encoder different from(2)to obtain depth features from the reflectance map and fuse boundary features to predict depth.In addition,we proposed FIFADataset,a depth-estimation dataset applied in soccer scenarios.Results Extensive experiments on a public dataset and our proposed FIFADataset show that our method achieves state-of-the-art performance.
文摘This paper aims to address the problem of supervised monocular depth estimation.We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation.Moreover,the Transformer and convolution are good at long-range and close-range depth estimation,respectively.Therefore,we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch.The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents.However,independent branches lead to a shortage of connections between features.To bridge this gap,we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner.Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps,we adopt the deformable scheme to reduce the complexity.Extensive experiments on the KITTI,NYU,and SUN RGB-D datasets demonstrate that our proposed model,termed DepthFormer,surpasses state-of-the-art monocular depth estimation methods with prominent margins.The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.
基金supported by the Australian Indian Strategic Research Fund(Project AISRF53820).
文摘Background Lack of depth perception from medical imaging systems is one of the long-standing technological limitations of minimally invasive surgeries.The ability to visualize anatomical structures in 3D can improve conventional arthroscopic surgeries,as a full 3D semantic representation of the surgical site can directly improve surgeons’ability.It also brings the possibility of intraoperative image registration with preoperative clinical records for the development of semi-autonomous,and fully autonomous platforms.This study aimed to present a novel monocular depth prediction model to infer depth maps from a single-color arthroscopic video frame.Methods We applied a novel technique that provides the ability to combine both supervised and self-supervised loss terms and thus eliminate the drawback of each technique.It enabled the estimation of edge-preserving depth maps from a single untextured arthroscopic frame.The proposed image acquisition technique projected artificial textures on the surface to improve the quality of disparity maps from stereo images.Moreover,following the integration of the attention-ware multi-scale feature extraction technique along with scene global contextual constraints and multiscale depth fusion,the model could predict reliable and accurate tissue depth of the surgical sites that complies with scene geometry.Results A total of 4,128 stereo frames from a knee phantom were used to train a network,and during the pre-trained stage,the network learned disparity maps from the stereo images.The fine-tuned training phase uses 12,695 knee arthroscopic stereo frames from cadaver experiments along with their corresponding coarse disparity maps obtained from the stereo matching technique.In a supervised fashion,the network learns the left image to the disparity map transformation process,whereas the self-supervised loss term refines the coarse depth map by minimizing reprojection,gradients,and structural dissimilarity loss.Together,our method produces high-quality 3D maps with minimum re-projection loss that are 0.0004132(structural similarity index),0.00036120156(L1 error distance)and 6.591908×10^(−5)(L1 gradient error distance).Conclusion Machine learning techniques for monocular depth prediction is studied to infer accurate depth maps from a single-color arthroscopic video frame.Moreover,the study integrates segmentation model hence,3D segmented maps are inferred that provides extended perception ability and tissue awareness.
基金supported by the National Natural Science Foundation of China under Grant No.61972298CAAI-Huawei MindSpore Open Fund,and the Xinjiang Bingtuan Science and Technology Program of China under Grant No.2019BC008.
文摘Based on well-designed network architectures and objective functions,self-supervised monocular depth estimation has made great progress.However,lacking a specific mechanism to make the network learn more about the regions containing moving objects or occlusion scenarios,existing depth estimation methods likely produce poor results for them.Therefore,we propose an uncertainty quantification method to improve the performance of existing depth estimation networks without changing their architectures.Our uncertainty quantification method consists of uncertainty measurement,the learning guidance by uncertainty,and the ultimate adaptive determination.Firstly,with Snapshot and Siam learning strategies,we measure the uncertainty degree by calculating the variance of pre-converged epochs or twins during training.Secondly,we use the uncertainty to guide the network to strengthen learning about those regions with more uncertainty.Finally,we use the uncertainty to adaptively produce the final depth estimation results with a balance of accuracy and robustness.To demonstrate the effectiveness of our uncertainty quantification method,we apply it to two state-of-the-art models,Monodepth2 and Hints.Experimental results show that our method has improved the depth estimation performance in seven evaluation metrics compared with two baseline models and exceeded the existing uncertainty method.
基金supported by the National Key Research and Development Program of China (Grant No. 2018YFC0809302)the National Natural Science Foundation of China (Grant Nos. 61988101,61751305 and 61673176)+1 种基金the Fundamental Research Funds for the Central Universities (Grant No.JKH012016011)the Programme of Introducing Talents of Discipline to Universities (the “111” Project)(Grant No. B17017)。
文摘Depth information is important for autonomous systems to perceive environments and estimate their own state. Traditional depth estimation methods, like structure from motion and stereo vision matching, are built on feature correspondences of multiple viewpoints. Meanwhile, the predicted depth maps are sparse. Inferring depth information from a single image(monocular depth estimation) is an ill-posed problem. With the rapid development of deep neural networks, monocular depth estimation based on deep learning has been widely studied recently and achieved promising performance in accuracy. Meanwhile, dense depth maps are estimated from single images by deep neural networks in an end-to-end manner. In order to improve the accuracy of depth estimation, different kinds of network frameworks, loss functions and training strategies are proposed subsequently. Therefore, we survey the current monocular depth estimation methods based on deep learning in this review. Initially, we conclude several widely used datasets and evaluation indicators in deep learning-based depth estimation. Furthermore, we review some representative existing methods according to different training manners: supervised, unsupervised and semi-supervised. Finally, we discuss the challenges and provide some ideas for future researches in monocular depth estimation.
基金partially supported by the Key Technological Innovation Projects of Hubei Province(2018AAA062)National Natural Science Foundation of China(61972298)Wuhan University-Huawei GeoInformatics Innovation Lab.
文摘Self-supervised monocular depth estimation has been widely investigated and applied in previous works.However,existing methods suffer from texture-copy,depth drift,and incomplete structure.It is difficult for normal CNN networks to completely understand the relationship between the object and its surrounding environment.Moreover,it is hard to design the depth smoothness loss to balance depth smoothness and sharpness.To address these issues,we propose a coarse-to-fine method with a normalized convolutional block attention module(NCBAM).In the coarse estimation stage,we incorporate the NCBAM into depth and pose networks to overcome the texture-copy and depth drift problems.Then,we use a new network to refine the coarse depth guided by the color image and produce a structure-preserving depth result in the refinement stage.Our method can produce results competitive with state-of-the-art methods.Comprehensive experiments prove the effectiveness of our two-stage method using the NCBAM.
基金supported in part by a fund from Bentley Systems,Inc.
文摘Recent advances in computer vision and deep learning have shown that the fusion of depth information can significantly enhance the performance of RGB-based damage detection and segmentation models.However,alongside the advantages,depth-sensing also presents many practical challenges.For instance,the depth sensors impose an additional payload burden on the robotic inspection platforms limiting the operation time and increasing the inspection cost.Additionally,some lidar-based depth sensors have poor outdoor performance due to sunlight contamination during the daytime.In this context,this study investigates the feasibility of abolishing depth-sensing at test time without compromising the segmentation performance.An autonomous damage segmentation framework is developed,based on recent advancements in vision-based multi-modal sensing such as modality hallucination(MH)and monocular depth estimation(MDE),which require depth data only during the model training.At the time of deployment,depth data becomes expendable as it can be simulated from the corresponding RGB frames.This makes it possible to reap the benefits of depth fusion without any depth perception per se.This study explored two different depth encoding techniques and three different fusion strategies in addition to a baseline RGB-based model.The proposed approach is validated on computer-generated RGB-D data of reinforced concrete buildings subjected to seismic damage.It was observed that the surrogate techniques can increase the segmentation IoU by up to 20.1%with a negligible increase in the computation cost.Overall,this study is believed to make a positive contribution to enhancing the resilience of critical civil infrastructure.
文摘Existing depth completion methods are often targeted at a specific sparse depth type and generalize poorly across task domains.We present a method to complete sparse/semi-dense,noisy,and potentially low-resolution depth maps obtained by various range sensors,including those in modern mobile phones,or by multi-view reconstruction algorithms.Our method leverages a data-driven prior in the form of a single image depth prediction network trained on large-scale datasets,the output of which is used as an input to our model.We propose an effective training scheme where we simulate various sparsity patterns in typical task domains.In addition,we design two new benchmarks to evaluate the generalizability and robustness of depth completion methods.Our simple method shows superior cross-domain generalization ability against state-of-the-art depth completion methods,introducing a practical solution to highqualitydepthcapture onamobile device.
基金Project supported by the Key R&D Program of Guangdong Province,China(No.2019B01015000)the National Natural Science Foundation of China(No.61902201)。
文摘Self-supervised depth estimation approaches present excellent results that are comparable to those of the fully supervised approaches,by employing view synthesis between the target and reference images in the training data.ResNet,which serves as a backbone network,has some structural deficiencies when applied to downstream fields,because its original purpose was to cope with classification problems.The low-texture area also deteriorates the performance.To address these problems,we propose a set of improvements that lead to superior predictions.First,we boost the information flow in the network and improve the ability to learn spatial structures by improving the network structures.Second,we use a binary mask to remove the pixels in low-texture areas between the target and reference images to more accurately reconstruct the image.Finally,we input the target and reference images randomly to expand the dataset and pre-train it on ImageNet,so that the model obtains a favorable general feature representation.We demonstrate state-of-the-art performance on an Eigen split of the KITTI driving dataset using stereo pairs.