Xây dựng bản đồ 3d trong nhà dùng hệ thống camera RGB-D
Bạn đang xem tài liệu "Xây dựng bản đồ 3d trong nhà dùng hệ thống camera RGB-D", để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên
Tài liệu đính kèm:
xay_dung_ban_do_3d_trong_nha_dung_he_thong_camera_rgb_d.pdf
Nội dung text: Xây dựng bản đồ 3d trong nhà dùng hệ thống camera RGB-D
- 3D INDOOR MAPPING USING A RGB-D CAMERA SYSTEM XÂY DỰNG BẢN ĐỒ 3D TRONG NHÀ DÙNG HỆ THỐNG CAMERA RGB-D Tran Cam Nhan1, Nguyen Tan Nhu1, Nguyen Thanh Hai1 1Faculty of Electric-Electronics Engineering, HCMC University of Technology and Education ABSTRACT The paper proposes with an application of a RGB Depth (RGB-D) camera system in 3D indoor mapping. In order to reconstruct the 3D model of an indoor space where the robot is located, the RGB-D camera system is installed in the robot frame to continuously capture the separate 2D image frames. All corresponding 2D points between the two consecutive image frames are firstly estimated using a Scan Invariant Feature Transform (SIFT) algorithm. Secondly, all pixel coordinates of these matching points are projected to the respective 3D space based on the depth information from the RGB-D camera at each pixel. The current outputs of this stage are the 3D points in the two successive point clouds that are applied to find out the transformation matrix. Finally, the estimated matrix is used to transform the second point cloud to the coordinating system of the first one. The result after repeating the above process with other consecutive pair of image frames and point clouds is the 3D model of the navigational space. The aimed method is relatively low cost configuration; furthermore, its accuracy is acceptable with the indoor robot applications. Keywords: RGB-D camera system, 3D indoor mapping, SIFT, transformation matrix, point clouds TÓM TẮT Bài báo sẽ trình bày ứng dụng hệ thống camera RGB-D để vẽ bản đồ 3D môi trường trong nhà. Để xây dựng mô hình 3D không gian trong nhà cho robot di chuyển thì camera RGB-D sẽ được lắp đặt trên khung của robot và camera này sẽ chụp những tấm ảnh 2D của môi trường liên tiếp nhau. Các điểm tương đồng 2D của hai khung ảnh liên tiếp nhau sẽ được ước lượng bằng thuật toán SIFT. Sau đó, tọa độ của mỗi điểm tương đồng 2D này sẽ được chuyển đổi sang tọa độ không gian 3D dựa trên thông tin độ sâu tương ứng mà camera RGB-D đo được tại mỗi điểm ảnh. Tọa độ 3D các điểm mây của hai đám mây liên tiếp nhau sẽ được áp dụng để tìm ma trận chuyển đổi. Cuối cùng, ma trận chuyển đổi này sẽ được dùng để dịch chuyển đám mây thứ hai về tọa độ hệ thống của đám mây thứ nhất. Kết quả sau khi lặp lại các bước trên với các cặp khung ảnh và đám mây liên tiếp nhau khác là mô hình 3D của không gian robot di chuyển. Giải pháp thực hiện được trình bày có giá thành thấp và độ chính xác thì có thề chấp nhận được với ứng dụng cho robot di chuyển trong nhà. 1. INTRODUCTION the great drawback of these was the lack of information in the third space dimension [5]. Robotic mapping is one of the most vital tasks in automatic robotic applications. The Realizing the negative trends of that 2D robot has to be supported the model of the planning methods, the 3D algorithms have navigational space in order to locate itself been continuously developing with the when moving. Moreover, the map is essential supports of famous classical findings. The for path planning processes in proposing the SLAM algorithm, for instance, was applied roadmap to target positions [1]. with 3D mapping methods to get the 3D model of large scale environments [6]. The robotic mapping is divided into 2D Consequently, the 3D mapping methods have mapping and 3D mapping. The 2D mapping recently become the trend of robotic mapping. has some disadvantages compared with 3D mapping. The instances of that are the The quality of mapping processes are classical application of sonar in mapping and mainly determined by the methods of navigation [2], the 2D mapping strategy based acquiring data from surroundings. Firstly, the on line segmentation [3], and the 2D mapping sonar sensors were used to obtain the ranges of a closed area by a range sensor [4]. One of from the robot to surrounding obstacles which are used to build a map [2]. The accuracy of
- sonar sensors were compensated by the 2. METHODS advantages of 2d laser scanner [7]. The stereo 2.1 The 3D mapping blocking diagram cameras were also taken advantage of mapping [8], but the time consuming to The Figure 1 presents the procedures of reconstruct the ranging information from concatenating two consecutive point clouds stereo images was costly. The KINECT RGB- based on the proposed method throughout the D sensor developed by Microsoft in 2012 was paper. After initialized to the predefined considered to be suitable for robotics [9]. This parameters, the RGB-D camera is employed kind of RGB-D sensors has not only the to capture both consecutive images and their suitable accurateness but also the fast speed of corresponding point clouds. Two continuous processing time. All above reasons are point clouds are concatenated based on the considered to be the motivation of developing matching information between two successive the 3D mapping algorithm supported by the images. The key points on the first and second RGB-D camera system. images are detected and described by SIFT algorithm before being able to find the The paper introduces the process of correspondences between them. The concatenating the discrete 3D point clouds correspondences in 2D are projected to 3D acquired from surroundings to form a using geometric transformation method. complete 3D model of navigational spaces. These corresponding pairs are used to The KINECT RGB-D sensor is firstly used to interpolate the transformation matrix, which acquire the pair of consecutive images and is applied to the second point cloud to the their corresponding 3D point clouds. Next, the coordination system of the first point cloud. SIFT algorithm [10] is applied in finding the The next cycle is going to be compiled after corresponding points in two 3D images. The the first pair of point clouds is united. The 3D corresponding points are estimated by final 3D model of the navigational space is, at projecting the resultant 2D matching points last, down sampled to reject the overlapped getting from the previous step to 3D space. points. The next sections of the paper is going Finally, two consecutive 3D point clouds are to sequentially present the above steps in concatenated by transforming the second detail with some experimental illustrations. point cloud to the first point cloud coordination. Repeating the process with the 2.2 SIFT key point detection next two consecutive point clouds, the 3D In this research, the SFIT algorithm is model of navigational space is becoming applied for detecting the pixels coordinator in larger. which their position is independent from scales [10]. Different scales of the original image are firstly computed by convoluting the image with the Gaussian function (1). Thus, scales of the original RGB image are represented by 퐿( , , 휎) which is estimated by (2), where is called the scale factor. The extreme locations of the Laplace transformation (휎2∇2 ), which is presented in equation (3), are scale-invariant pixel locations on the RGB image of surroundings. (휎2∇2 ) can also be computed by Equation (4). 2+ 2 1 − 2 2휎2 ( , , 휎) = 2 2 푒 (1) Figure 1. The process of concatenating two 2 휎 consecutive point clouds 퐿( , , 휎) = ( , )* ( , , 휎) (2)
- 휕 ( , , 휎)− ( , ,휎) = 휎∇2 ≈ (3) coordination presented as ( , ), depth , 휕휎 휎−휎 and physical characteristics of the RGB-D ( , , 휎)− ( , ,휎) 휎2∇2 ≈ (4) camera . The parameter is estimated ( −1) during the calibration process. The origin of The Laplace transform of the Gaussian 3D coordination system is placed on the function, (휎2∇2 ) is approximately equal to center of the 2D image, so the pixel location the difference of continuous scale versions is shifted to a distance of ( 0, 0), which is of 퐿( , , 휎), shown in the equation (5). The the center coordinates of the 2D image, as scale invariant pixel locations are finally the shown in the Figure 3. minimum or maximum locations of = −( − ) × × (7) ( , , 휎). 0 푌 = −( − ) × × (8) ( , , 휎) 0 푍 = (9) = ( ( , , 휎) − ( , , 휎)) ∗ ( , ) = 퐿( , , 휎) − 퐿( , , 휎) (5) 2.3 Key point matching After located on the 2D image, every key point is described by a 128-dimension vector. Assuming that 푴( 1, 2, 3, , 128) is a set of key point feature vectors located on the first image and 푵( 1, , 3, , 128) is one on the second image. All key points on the first image are found individually their correspondences on the second image. Two key points are consider to be the same if the Figure 2. A RGB-D matrix data acquired from the KINECT sensor Euclidean distance between them, presented on the equation (6), is less than a pre-determined constant 훿. 2 2 = √( 1 − 1) + ⋯ + ( 128 − 128) (6) 2.4 2D-to-3D projection The process of projection has to take advantages of both pixel coordination and depth value from the RGB-D data. Each RGB-D pixel in a matrix data acquired from KINECT sensor, as shown in the Figure 2, is composed of two classes of information that are Red-Green-Blue color parts and depth value. Every pixel is only characterized by pixel coordinates symbolized by ( , ), so it is necessary that the depth information must be employed to obtain the 3D coordinate of that pixel position. The method Figure 3. A 3D coordination system on a 2D of the 2D-to-3D projection is illustrated in the image equation (7), (8), and (9). As expressed in 2.5 3D rotation and transformation these equations, an individual 3D point Suggested that there are two sets of 3D ( , 푌 , 푍 ) is calculated based on pixel points which are 푷 and 푷 acquired from
- the first and second view of the RGB-D equation (14), (15), the point clouds 푃2, 푃3 camera. The point cloud 푃0 is based on the are able to transformed to the first first coordination system 0 , and the coordination system. The matrices 10, 21, point cloud 푃1 is based on the different and 32 are inferred from the process of coordination system 1 . The rotation and concatenating the two next consecutive point translation matrix 푴 formed as the clouds. The result of 3D model of robot’s equation (10) has its responsibility of moving moving space is discussed in the next section the point cloud 푃1 to the 푃0’s coordination, of the paper. which is solved by the formula (11) and (12). 푃1( 0 ) = 푃1( 1 ) × 푴10 (13) 푃2( 0 ) = 푃2( 2 ) × 푴20 (14) 푃3( 0 ) = 푃3( 3 ) × 푴30 (15) 푴20 = 푴21 × 푴10 (16) 푴30 = 푴32 × 푴21 × 푴10 (17) Figure 4. The 푷 -to-푷 transformation matrix 11 12 13 14 21 22 23 24 Figure 5. Different 3D point clouds with 푴 = [ ] (10) 31 32 33 34 individual coordination system 0 0 0 1 푷 = 푷 × 푴 (11) 3. RESULTS AND DISCUSSIONS 푖 푖 This section presents results of cloud [ ] = 푖 image acquisition and then processing image 1 frames to produce 3D images. 11 12 13 14 푖 3.1 Image and point cloud acquisition [ 21 22 23 24] × [ 푖] (12) 31 32 33 34 푖 KINECT sensor used as the RGB-D 0 0 0 1 1 camera is considered to be the most suitable 2.6 3D point cloud concatenation input device in the paper. The data acquired from KINECT is comprised of RGB and All 3D point clouds are concatenated depth images, whose accuracies are able to be together based on the pair transformation accepted by indoor robotic applications. matrices. Supposed that there are 4 3D point Moreover, the depth data is pre-processed by clouds based on discrete coordination systems, KINECT’s hardware, so the higher level that is 푃0 , 푃1 , 푃2 , 푃3 based on the threads can focus on useful tasks. Figure 6 coordination systems of 0 , 1 , shows two consecutive images captured by 2 , and 3 respectively, as shown in KINECT. According the each image, the two the Figure 5. Moving to the first coordination continuous 3D point clouds are also taken, as systems, the 푃1 has to be applied by the shown in the Figure 7. Two images and matrix 10 like the equation (13). point clouds contain both new and old Compiling the same process above in these information; consequently, the combination
- of them is expected to cover more additional data. The next step of concatenating process is to estimate the pixel locations of key points from the two images. Figure 6. Two consecutive RGB images captured from KINECT sensor Figure 8. Key points on the first 2D image Figure 7. Two consecutive point clouds captured from KINECT sensor 3.2 2D key point estimation The SIFT algorithm is applied to locate the key points in both first and second images whose characteristics are independent from Figure 9. Key points on the second 2D scales and rotations. The Figure 8 and Figure image 9 show these key points which are marked in 3.3 3D key point matching the white circles. The key points are mainly focused on the areas where the differences in After estimated and described by SIFT, gray levels are high. Next, each detected key the key points on the second image are point is described by a 128-dimension vector matched with their correspondences if they according to the SIFT method presented in satisfy the criteria in (6) between the two pair [10] to be able to recognize easily by Euclid’s of vectors. Figure 10 shows all corresponding distance equation (6). The two key points are key points on the first and second images. consider to be matched if the distance Each matching is presented by a green line between their own vectors is less than a pre- connected between two key points. The defined constant, called 휕. number of matchings is over 200 pairs. Each matching, after that, is projected to the 3D space by using the equation (7), (8), and (9). In some cases, the depth information in a specific pixel on 2D images is not able to determine due to some incorrect refection, so the matching point at that pixel location have to be eliminated. The remaining pairs of matching key points are just over 100 pairs in the case illustrated by the Figure 11 where the matching cases are presented by the red lines.
- 3.5 Consecutive concatenation The Figure 13 shows one frame of the robot’s moving space. In this view, there are some artificial landmarks organized randomly in the moving space. When the robot moves straight, the 3D model of this space is illustrated on the Figure 14 and 15. In the case Figure 10. Matching key points between the of curving, some artificial landmarks are also first and second 2D images set statically on the curving line, which is presented on the Figure 16. However, the result of 3D model is not formed very well, as shown in the Figure 17. When the number of landmarks is increased, as shown in the Figure 18, the 3D model of the curving space becomes smoother, which is presented in the Figure 19. The result is also proposed that the 3D model of an entire room is able to be acquired when the robot is static and rotated Figure 11. Matching key points between the around this room. The 3D model of an entire first and second 3D point clouds room is displayed on the Figure 20 and 21. 3.4 Pair concatenation Figure 13. The first frame of the robot’s Figure 12. The concatenated point cloud moving space As shown in the equation (10) the transformation matrix has the size of 4 × 4, in which 12 parameters are unknown; therefore, there must be at least 12 pair of corresponding points in 3D cloud in order to infer the correct values based on the system of equation from (12). However, the number of correspondences is often larger than 12 due to the errors happened when matching two point clouds. After the transformation matrix is found out, the matrix is applied to the second point cloud so as to have the same coordinating system as the first point cloud. The result of the concatenating point cloud is Figure 14. The first 3D view of the straight described in the Figure 12. moving space
- Figure 18. The increased number of landmarks Figure 15. The second 3D view of the straight moving space Figure 19. The smoother curving space when increasing the number of landmarks Figure 16. The landmarks set on the curving space Figure 20. The first view of a 3D model of the sample room where the robot is rotated Figure 17. The interrupted 3D model of curving space
- 4. CONCLUSIONS In the paper, the 3D model of both navigational spaces in indoor areas was completely reconstructed based on RGB-D image frames obtained from a KINECT system. A SIFT algorithm was employed to detect pixels coordinator to perform the optimal 3D model of navigation spaces. Experimental results prove that the KINECT system can be cheaper cost computation for solution of replacing other stereo camears Figure 21. The second view of a 3D model of with higher cost. Moreover, it shows that the the sample room where the robot is rotated effectiveness of the proposed method. REFERENCES [1] J. Fuentes-Pacheco, Visual simultaneous localization and mapping: a survey, in Springer Science+Business Media Dordrecht, ed, pp. 55–81, 2015. [2] A. Elfes, Sonar-based real-world mapping and navigation, in IEEE Journal on Robotics and Automation vol. 3, ed, pp. 249-265, 1987. [3] C. P. O. Diaz and A. J. Alvares, A 2D-mapping strategy based on line segment extraction, in ANDESCON, 2010 IEEE, ed, pp. 1-6, 2010. [4] S. Ogawa, K. Watanabe, and K. Kobayashi, 2D mapping of a closed area by a range sensor, in SICE 2002. Proceedings of the 41st SICE Annual Conference vol. 2, ed, pp. 1329-1333 vol.2, 2002. [5] R. Z. M. Dr.Wael R. Abdulmajeed, Comparison Between 2D and 3D Mapping For Indoor Environments, in International Journal of Engineering Research and Technology vol. 2, ed, 2013. [6] J. L. Cras and J. Paxman, A modular hybrid SLAM for the 3D mapping of large scale environments, in Control Automation Robotics & Vision (ICARCV), 2012 12th International Conference on, ed, pp. 1036-1041, 2012. [7] L. Hung-Hsing, T. Ching-Chih, H. Ssu-Min, and C. Hsu-Yang, Automatic mapping for an indoor mobile robot assistant using RFID and laser scanner, in SICE, 2007 Annual Conference, ed, pp. 2102-2108, 2007. [8] D. Schleicher, L. M. Bergasa, R. Barea, E. Lopez, and M. Ocana, Real-Time Simultaneous Localization and Mapping Using a Wide-Angle Stereo Camera, in Distributed Intelligent Systems: Collective Intelligence and Its Applications, 2006. DIS 2006. IEEE Workshop on, ed, pp. 55-60, 2006. [9] C. Lim Chee, S. N. Basah, S. Yaacob, M. Y. Din, and Y. E. Juan, Accuracy and reliability of optimum distance for high performance Kinect Sensor, in Biomedical Engineering (ICoBE), 2015 2nd International Conference on, ed, pp. 1-7, 2015. [10] D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, in International Journal of Computer Vision vol. 60, ed, pp. 91-110, 2004. Xác nhận của giáo viên hướng dẫn
- Main author communication information Full name: Nguyễn Thanh Hải University: Faculty of Electric and Electronics Engineering, Ho Chi Minh City University of Technology and Education Mobile phone: (+84) 9ở8806 Email: nthai@hcmute.edu.vn
- BÀI BÁO KHOA HỌC THỰC HIỆN CÔNG BỐ THEO QUY CHẾ ĐÀO TẠO THẠC SỸ Bài báo khoa học của học viên có xác nhận và đề xuất cho đăng của Giảng viên hướng dẫn Bản tiếng Việt ©, TRƯỜNG ĐẠI HỌC SƯ PHẠM KỸ THUẬT TP. HỒ CHÍ MINH và TÁC GIẢ Bản quyền tác phẩm đã được bảo hộ bởi Luật xuất bản và Luật Sở hữu trí tuệ Việt Nam. Nghiêm cấm mọi hình thức xuất bản, sao chụp, phát tán nội dung khi chưa có sự đồng ý của tác giả và Trường Đại học Sư phạm Kỹ thuật TP. Hồ Chí Minh. ĐỂ CÓ BÀI BÁO KHOA HỌC TỐT, CẦN CHUNG TAY BẢO VỆ TÁC QUYỀN! Thực hiện theo MTCL & KHTHMTCL Năm học 2017-2018 của Thư viện Trường Đại học Sư phạm Kỹ thuật Tp. Hồ Chí Minh.