Learning to Detect Mirrors from Videos via Dual Correspondences

1 City University of Hong Kong 2 East China Normal University
*Joint first authors
CVPR 2023

Although state-of-the-art single-image mirror detection method VCNet performs well on a single image (e.g., the first row) by using implicitly intra-frame correspondence, it may fail when the intra-frame cue is weak or even absent in some video frames (e.g., the second and third rows). The lack in exploiting inter-frame information causes the current mirror detection methods to produce inaccurate and inconsistent results when applied to VMD. In contrast, our method can perform well in both situations by utilizing the proposed dual correspondence module to exploit intra-frame (spatial) and inter-frame (temporal) correspondences.


Detecting mirrors from static images has received significant research interest recently. However, detecting mirrors over dynamic scenes is still under-explored due to the lack of a high-quality dataset and an effective method for video mirror detection (VMD). To the best of our knowledge, this is the first work to address the VMD problem from a deep-learning-based perspective. Our observation is that there are often correspondences between the contents inside (reflected) and outside (real) of a mirror, but such correspondences may not always appear in every frame, e.g., due to the change of camera pose. This inspires us to propose a video mirror detection method, named VMD-Net, that can tolerate spatially missing correspondences by considering the mirror correspondences at both the intra-frame level as well as inter-frame level via a dual correspondence module that looks over multiple frames spatially and temporally for correlating correspondences. We further propose a first large-scale dataset for VMD (named VMD-D), which contains 14,987 image frames from 269 videos with corresponding manually annotated masks. Experimental results show that the proposed method outperforms SOTA methods from relevant fields. To enable real-time VMD, our method efficiently utilizes the backbone features by removing the redundant multi-level module design and gets rid of post-processing of the output maps commonly used in existing methods, making it very efficient and practical for real-time video-based applications.


      author    = {Lin, Jiaying and Tan, Xin and Lau, Rynson W.H.},
      title     = {Learning To Detect Mirrors From Videos via Dual Correspondences},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month     = {June},
      year      = {2023},
      pages     = {9109-9118}