Lingli Zhu
Dr. Lingli Zhu received her Doctoral Degree in the field of Remote Sensing at the School of Engineering at Aalto University in 2015. She achieved a Master Degree in Photogrammetry at Helsinki University of Technology (Current: Aalto university) in 2007. Currently she works in National Land Survey of Finland. She has published 30+ SCI papers as the first author and three book chapters. One of the book chapters has reached a downloaded record of 3000+. She has been invited as a guest editor for special issues in several well known journals such as IJGI(2018, 2021, 2022), Remote Sensing (2019, 2020), and International Journal of Applied Earth Observation and Geoinformation (JAG) (2021, 2022). She has reviewed 40+ papers for Remote Sensing, Sensors, IJGI, and so on.
Sessions
Manual digitization of 3D information from aerial stereo images has been one of the major tasks in national mapping agencies. However, it is labor-intensive. There is an enormous need in developing an automatic method for extracting 3D information from stereo images. Recent advancement in hardware and software provides the possibility of realizing full automation in stereo-image tasks. Stereo-image tasks require a large capability of computational power. The emergence of the GPU gave great support to such technique development. With recent advances in AI, machines are gaining the ability to learn, improve, and execute repetitive tasks precisely, especially with deep learning techniques: the capacity of combining and adjusting millions or even billions of parameters from a neural network. Therefore, it becomes possible to automize many complex tasks.
OpenCV was built in 2008. It is an open-source library that includes several hundreds of computer vision algorithms. OpenCV supports functions of epipolar geometry estimation and constraint as well as depth calculation from stereoimage. Before 2016, many researchers have employed OpenCV (Open Source Computer Vision Library) for depth estimation from stereoimage. In recent years, using deep learning methods for obtaining depth maps from stereoimage has been highlighted. In deep learning applications, left and right images usually need to be rectified before they can be fed to the network. GC-Net, HRS Net, MVSNet, PMS Net, and PLUMENet are examples of convolutional neural networks (CNNs) that can be used for this purpose. GC-Net was introduced in 2017 by Kendall et al. [1], PMSNet in 2018 by Chang et al. [2], MVSNet in 2018 by Yao et al. [3], and HRS Net in 2019 by Yang et al. [4], PLUMENet in 2021 by Wang et al.. Disparity images are typically used as labels, but some networks work with unsupervised learning, meaning no labels are used for training them. Some experiment was based on open-source datasets, such as KITTI stereo and Middlebury stereo being good examples. Ready remote sensing stereo image datasets still seem to be quite scarce, but at least some can be found, for example, stereo image dataset of Vaihingen: Aerial Stereo Dense Matching Benchmark introduced in 2021 [5].
Our experiment was focused on obtaining disparity maps i) from aerial stereo images with known orientation parameters using openCV; ii) from rectified aerial stereo images with deep neural networks: GC-Net, MVSNet, and PSM net. The results based on OpenCV and neural networks were compared and evaluated.
Two datasets were used in the experiment. One dataset was the aerial stereo images with known orientation parameters from National Land Survey of Finland. The aerial images were acquired in 2020, using the UltraCam Eagle Mark3 (Vexcel, Austria), with a forward overlap of 80% and a side overlap of 30% between flight stripes. The flight height was 7657.9 m. The image has a spatial resolution of 30 cm. Another set was from the ISPRS Aerial Stereo Dense Matching Benchmark 2021[5]: the Vaihingen dataset. The Vaihingen dataset from the ISPRS 3D reconstruction benchmark provides a good registration of oriented images and LiDAR point clouds. The dataset is composed of 20 images with a depth of 11 bits and a ground sample distance (GSD) of 8 cm. The reference depth maps were produced by Lidar point clouds for evaluation.
In the experiment of using the OpenCV library, known orientation parameters were used for image rectification. ORB (Oriented FAST and rotated BRIEF) features were used to find image matching points. ORB is open source, which is an efficient alternative to SIFT or SURF. The algorithm uses FAST in pyramids to detect stable keypoints, selects the strongest features using FAST or Harris response, finds their orientation using first-order moments and computes the descriptors using BRIEF (where the coordinates of random point pairs (or k-tuples) are rotated according to the measured orientation).
In the experiment of using deep learning methods, GC-Net, MVSNet, and PSM net, were tested. GC-Net is an end-to-end deep stereo regression architecture [1]. It estimates per-pixel disparity from a single rectified image pair by employing a cost volume to reason the geometry and utilizing a deep convolutional network formulation for reasoning the semantics. MVSNet is an end-to-end deep learning architecture for depth map inference from multi-view images [3]. It computes one depth map at each time by extracting deep visual image features, building the 3D cost volume upon the reference camera frustum, and applying 3D convolutions to regularize and regress the initial depth map to generate the final output. PSMNet is a pyramid stereo matching network consisting of two main modules: spatial pyramid pooling and 3D CNN [2]. It exploits global context information in stereo matching. PSMNet extends pixel-level features to region-level features with different scales of receptive fields by pyramid pooling module. The cost volume was formed by combining global and local feature clues. A stacked hourglass 3D CNN was designed to repeatedly process the context information for estimating cost volume in a top-down/bottom-up manner to improve the utilization of global context information.
The results from two datasets with three networks and OpenCV were presented. The experiments exhibited that selecting proper loss function and learning rate is important in using neural networks. It affects the performances and results of different networks. The results were evaluated by comparing with the reference depth maps. The advantages and disadvantages of using networks and OpenCV library were analyzed and discussed.