Scale-dissected Pose Estimation in Visual Odometry by Rong Yuan A Thesis submitted in partial fulfillment of the requirements for the Degree of Master of Science in the School of Engineering at Brown University Providence, Rhode Island May 2017 c Copyright 2017 by Rong Yuan AUTHORIZATION TO LEND AND REPRODUCE THESIS As the sole author of this thesis, I authorize Brown University to lend it to other institutions or individuals for the purpose of scholarly research. Date: Signature: Rong Yuan, Author I further authorize Brown University to reproduce this thesis by photocopying or other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research. Date: Signature Rong Yuan, Author This thesis by Rong Yuan is accepted in its present form by the School of Engineering as satisfying the thesis requirement for the degree of Master of Science. Date Benjamin B. Kimia, Advisor School of Engineering Approved by the Graduate Council Date Andrew G. Campbell Dean of the Graduate School iii Acknowledgements To my advisor, Prof. Benjamin Kimia, thank you. Thank you for bringing me into the field I might devote my future career to. It was the best part of my graduate study. Many thanks to all LEMS colleagues. I appreciate every bit of help and encouragement from you. A special gratitude to Hongyi, my partner in BlindFind in the past two years. Good luck to you in the future BlindFind project. I’m so grateful to have this precious experience during my graduate study and my life. Thank you again to you all. iv Contents List of Tables vi List of Figures vii 1 Introduction 1 2 Related Works 3 3 Framework and Formulation 5 4 Untangling scale in pose estimation 13 5 Implementation Details 20 6 Experimental Results 21 7 Conclusions And Future Works 26 Bibliography 28 v List of Tables vi List of Figures 1.1 BlindFind Sysmtem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3.1 This figure illustrates the setup: a stereo camera moves along path C(t) and also rotates to change the viewpoint so that the pose is completely described by R(t) and T(t). (a) Stereo visual odometry computations are based on a pair of image at time ti and time tj . We use Ri = R(ti ) and Rj = R(tj ) for shorthand. (b) The indices i and j refer to coordinate at times ti and tj . We use a central coordinate system represented by (X, Y, Z) to represent the left and right coordinate system (X − , Y − , Z − ) and (X + , Y + , Z + ), respectively. (c) An example of the left and right image at time ti and tj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.1 The proposed approach determines the relative pose of each pair of adjacent keyframes from matching features, giving (n,n+1 , Tˆn,n+1 ) but leaving a free scale variable λn,n+1 for each pair of frames. In this paper we show how to calculate scale for adjacent pairs of frames. We show that scale-independent estimation of (n,n+1 , Tˆn,n+1 ) from the full set of features improves robustness. . . . . . . . . . . . . . . 13 4.2 Direct pose estimation between two frames based on direct matching of features is more accurate than pose estimation based on tracking those features matched from a previous frame. We use the ground truth scale in comparing PnP to direct estimation to illustrate the improvement in isolation. This is shown for two distinct tracks. . . 14 λn,n+1 4.3 The distribution of λn−1,n for all features in the triple of image (In−1 , In , In+1 ) shown in full in (a) and magnified in (b). Observe that while the distribution is narrow, there is a range of possible values. Also note there are some outliers which can easily be identified. Disregarding these outliers we can examine the rest of the distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 vii λn,n+1 4.4 The comparing between ground truth relative scale λn−1,n and estimated relative scale with our method. In this figure, our method can estimate correct relative scale between keyframes. There are several wrong estimations, which will propagate. . . 16 4.5 (a) The 3D reprojection error is typically computed by first triangulating points and then reporjecting into another view. (b) Alternatively, the point can be directly estimated without triangulation, effectively as the intersection of epipolar lines. . . 17 4.6 (a) Stereo visual odometry provides absolute scale and prevents propagation of scale errors. (b),(c) different ways of establishing absolute scale. (d) regularization of absolute scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.7 The result of KITTI00. (a) Output path. (b) Comparison of rotation error. (c) Com- parison of translation error. The result of KITTI07. (d) Output path. (e) Comparison of rotation error. (f) Comparison of translation error. . . . . . . . . . . . . . . . . 19 6.1 A comparison of scale/speed estimation of the PnP-based algorithm (left column) and ours (middle column) against the ground truth. The distribution of errors are plotted in the right column. Rows corresponds to tracks 00 and 07 of KITTI. . . . . 22 6.2 All tracks we experimented on. Left: Average translation Error. Middle: Average Rotation Error. Right: Average Speed Error. . . . . . . . . . . . . . . . . . . . . 22 6.3 The epipolar lines and epipole from ground truth pose, the ones that intersect at the center of the image, and those nearly parallel lines from algorithm estimate. . . . . 23 6.4 This depicts an example of failure mode 2. . . . . . . . . . . . . . . . . . . . . . . 24 6.5 (a) The features extracted from last frame. (b) After matching only very few of them stay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.6 The images are blurred and not enough features could be tracked. . . . . . . . . . . 25 viii Chapter 1 Introduction Visual Odometry(VO), the procedure for the construction of the trajectory of the moving platform from videos captured by one or more camera mounted on the moving platform has become an increasingly important problem in computer vision. This is due to an increasing range of applica- tions, robotics [1], autonomous driving cars [2], drones or unmanned aerial vehicles(AUV) [3], and personal navigation. Due to the affordability and environmental compatibility of visual odometry, comparing with its alternative method such as radar and GPS, a more robust and accurate visual odometry method has the ability to lead to a significant progress in this area, as well as the progress in a wide range of applications. This thesis aims to solve the odometry problem for BlindFind, a project aimed at aiding the visually-impaired in navigation by building an wearable computer vision rig. The focus of BlindFind in this project is indoor navigation, where it is expected to react to voice commands, plan the paths, give feedback and direct the user to his destination. The device is equipped with a voice-recognizer that listens to user commands, and a haptic belt that generates feedback by vibrating in a certain direction. The core of Blindfind is a Visual Odometry module, as that is what brings the user to a designated place, Fig ??. Early implementation of the visual odometry module was relying on running open source visual odometry libraries, specifically LibViso2 [4]. This is one of the first open sources visual odometry libraries. It uses the classical PnP-based approach that generates 3D landmarks from the stereo pair in current frame, and minimizes reprojection error in the next ad- jacent frame to obtain a relative pose estimate. The key problem in visual odometry is to estimate the 6DOF pose (3 parameters for rotation and 3 parameters for translation) from the correspon- dence points from input image sequences using pin hole camera model and projective geometry. Traditionally, just like what Libviso2 does, after getting the correspondence points, the pose param- eters of each image was estimated by solving an optimization problem, using an outlier rejection 1 scheme such as RANSAC. Such methods have reached appreciable performances in some bench- marks. However, the optimization process suffers numerical issues which makes the result not robust enough, and the cost of the RANSAC or alternative scheme is relatively high. Observe that the monocular technique is able to estimate 5 degrees of freedom, i.e., a rotation vector and a unit translation vector fairly accurately. Thus, it is not necessary to solve for rotation, unit translation and scale all at the same time, which complicates the optimization problem. Rather, we propose a visual odometry approach that dissects the scale estimation from pose calculation. Instead of optimizing 6 parameters simultaneously, our approach only solves the 1DOF problem by using an analytical out- lier rejection scheme which gives large improvements in robustness and efficiency. Theoretically, the number of iterations N that is necessary to guarantee that a correct solution can be computed log(1−p) with log(1−(1−)s ) , where s is the number of minimal data points,  is the percentage of outliers in the data points and p is the requested probability of success [5]. With the traditional 6DOF scheme, if we picked 5 points in a bundle, at least 100 iterations would be called for. Our outlier rejection scheme takes in one single input at a time, leading the iteration number to decrease dramatically. Besides, traditional methods rely highly on the 3D reconstruction, which faces the issue that it is sometimes hard to guarantee that the projection lines from two images intersect at the same point, setting instead a mid-point solution. In this paper, we propose an approach that uses epipolar geom- etry bypassing the triangulation procedure, rather using a direct estimation where triangulation can be prevented. The experiments show that our scheme generates competitive results when compared to many state-of-art methods. Figure 1.1: BlindFind Sysmtem Overview 2 Chapter 2 Related Works Current visual odometry method can be generally grouped into three types, based on the different data they use. First, 3D-to-3D methods use 3D world constructed from two consecutive image pairs. The input of this type of method comes from stereo camera [1] or RGB-D camera [6] which are able to build cloud of points for each time stamp. After that, the incremental motion is computed from the correspondence of 3D points. The performance of this type of method highly depends on the precision of the 3D points, thus the alternative method was used. [6] uses 2D-to-2D correspondence to estimate the quadrifocal tensor. Second, 2D-to-2D methods use the correspondence between im- ages to estimate the motion between two frames, Nister [7] used his five point algorithm [8] to solve the relative rotation and translation up to a scale. The key step in this type of methods is the esti- mation of the scale. The scale cannot be computed from the 2D space. A traditional solution [9] is to put every point into 3D, calculate the scale for each point, then use the mean of scales of all points, which is not stable enough. Neither 3D-to-3D nor 2D-to-2D could perfectly solve the prob- lem alone. 3D-to-2D scheme reaches better performance than previous two schemes. The idea of this method is trying to minimize the reprojection error from 3D-to-2D points correspondence. A milestone in this area, [10], computes 3D points from stereo pair, and then minimize the reprojection error. Based on this scheme, Kitt [4] proposed an approach which added outlier rejection scheme RANSAC. With the inspiration from structure from motion and multiview stereo area, numbers of successful SLAM systems [11, 12, 13] were using this scheme which achieved very good results by deploying global bundle adjustment. [?] combined feature tracking, pose estimation and local bun- dle adjustment which reached state-of-the-art performance. In all these works, bundle adjustment is a powerful tool to compensate the drift. However, the computational cost of bundle adjustment is high. Based on this, [14] uses the idea from 2D-to-2D scheme, they solved 3-dimensional rotation matrix by using five points algorithm in closed form, then optimized the 3-dimensional translation 3 by minimizing reprojection error. This method reaches one of the best performance without bundle adjustment. There also have been other attempts from researchers to boost the robustness of the reprojection error. [15, 16, 17, 18] used motion prior to constrain the vehicle motion, which made the optimization problem a lower dimensional problem (1D or 2D). Our method has three main differences from previous works. First, we decouple the 1DOF optimization problem from the other 5 dimensions. Second, instead of reprojection error, we use trinocular model as the geometric model, which prevents the estimation error in triangulation. Fi- nally, we propose an analytical 1-point RANSAC scheme to increase the efficiency of traditional 6DOF RANSAC scheme. 4 Chapter 3 Framework and Formulation Problem Statement Odometry refers to computing the trajectory of a moving object, e.g., a robot. Visual odometry refers to computing the trajectory based on one or more cameras, as opposed to inertial sensor (IMU). Monocular visual odometry uses one camera for this task, while stereo visual odometry uses a pair of cameras. Monocular visual ododmetry has the disadvantage of scale ambiguity, i.e., the recovered trajectory is known up to a scale factor, while stereo odometry results in a unique trajectory with known scale. In addition, the extra image in stereo odometry provides a second source of information leading to a more robust result, assuming the distance between the two camera is reasonably large compared to distance to the object under view, a well-known result from stereo vision. Problem Formulation A stereo camera is moving along a trajectory C(t). In our approach, the stereo rig is represented by a “central coordinate system” so that the pose of the right and left cameras with respect to the central coordinate system are represented by (R0+ , T0+ ) and (R0− , T0− ), respectively, In other words, a point in the central camera coordinate system expressed as Γ = (X, Y, Z) in the left camera coordinate system by Γ− = (i− , Y − , Z − ) and expressed as Γ+ i = (X + , Y + , Z + ) in the right camera coordinate system satisfy the following relationships:   Γi = R− Γ− + T − Γ− = (R− )−1 Γi − (R− )−1 T − 0 i 0 i 0 0 0 (3.1) (3.2) Γ = R+ Γ+ + T + Γ+ = (R+ )−1 Γ − (R+ )−1 T + i 0 i 0 i 0 i 0 0 5 ((a)) ((b)) ((c)) Figure 3.1: This figure illustrates the setup: a stereo camera moves along path C(t) and also rotates to change the viewpoint so that the pose is completely described by R(t) and T(t). (a) Stereo visual odometry computations are based on a pair of image at time ti and time tj . We use Ri = R(ti ) and Rj = R(tj ) for shorthand. (b) The indices i and j refer to coordinate at times ti and tj . We use a central coordinate system represented by (X, Y, Z) to represent the left and right coordinate system (X − , Y − , Z − ) and (X + , Y + , Z + ), respectively. (c) An example of the left and right image at time ti and tj . 6 In our case the stereo cameras are parallel so that     1 0 0 ±d ±   ±   R0 = 0 1 0 , T0 =  0     , (3.3) 0 0 1 0 where 2d is the distance between two camera centers. For each camera, the relationship between pixel coordinates and metric image coordinates can be expressed through the intrinsic matrix K, whose elements are based on knowledge of camera center (ξo , ηo ) in pixel units, αx , the ratio of focal length f and width of a pixel, αy , the ratio of focal length f and height of a pixel, and sθ the skew factor which we take to be zero. These values are obtained from a calibration step done by cameraCalibration method in Matlab. Let γ T = [γξ ; γη ; 1] be the homogeneous image coordinate of a pixel (ξ, η), the connection between these two is described as,         ξ γx αξ sθ ξo α ξ 0 ξo         η  = K γy  , where K =  0 αη ηo  =  0 αη ηo  . (3.4)         1 1 0 0 1 0 0 1 The problem is to recover the camera trajectory by finding the pose of each camera with respect to the world coordinate, Figure 3.1(a). It is well known that the pose between two cameras can be computed up to scale by finding a sufficient number of corresponding features in acquired images. The typical visual odometry approach is to find the relative pose between two camera samples in time and to construct and integrate path over time. Let the relative rotation and translation between the central camera coordinate systems at time ti and tj be Rij and Tij Figure 3.1(b), i.e., the position of a point expressed in the central camera coordinate system (Xi , Yi , Zi ) at time ti , Γi , is related to the central camera coordinate system (Xj , Yj , Zj ) at time tj , Γj , by −1 −1 Γi = Rij Γj + Tij , Γj = Rij Γi − Rij Tij . (3.5) Since the only information about relative pose is based on matching features in the left and right 7 cameras, the computation of (Rij , Tij ) is necessarily tied to computing (R± , T± ), namely, the relative pose between each of the left and right cameras at time ti and tj , respectively, i.e. a point expressed in the left/right camera coordinate system at time ti , Γ± i , is related to the expression in the left/right camera coordinate system at time tj , Γ± j , as Γ± ± ± ± i = Rij Γj + Tij , Γ± ± −1 ± ± −1 ± j = (Rij ) Γi − (Rij ) Tij . (3.6) − Monocular visual odometry based on the left camera gives (Rij , Tij− ), and monocular visual odome- + ± ± try based on the right camera gives (Rij , Tij+ ). The relationship between (Rij , Tij ) and (Rij Tij ) can be easily derived. Considering the left camera, for example, can be written as functions of (Rij and Tij ) by first relating Γ− − − i to Γi via the first part of Equation(3.1), then Γi to Γj via Equation(3.6), then relating Γj to Γ− j via the first part of Equation(3.2): Γi = R0− Γ− − i + T0 − − = R0− (Rij Γj + Tij− ) + T0− − = R0− (Rij ((R0− )−1 Γj − (R0− )−1 T0− ) + Tij− ) + T0− = [R0− Rij − (R0− )−1 ]Γj + [−R0− Rij − (R0− )−1 T0− + R0− Tij− + T0− ]. (3.7) resulting in  Rij − = R0− Rij (R0− )−1 (3.8) T ij − = R0− Tij− + T0− − R0− Rij (R0− )−1 T0 − Conversely, Rij and Tij− can be written as functions of Rij and Tij as  R− = (R0− )−1 Rij R0− ij (3.9) T − = (R0− )−1 Tij − (R0− )−1 T0− + Rij (R0− )−1 T0− ij + and Rij and Tij+ can be written as functions of Rij and Tij as  R+ ij = (R0+ )−1 Rij R0− (3.10) T + = (R0+ )−1 Tij − (R0+ )−1 T0+ + Rij (R0+ )−1 T0+ ij In our case described in equation 3.3, R0± are identity matrices. Thus, 8   Rij − R− = Rij = Rij ij (3.11) (3.12) T ij = Tij− + T0− − Rij − − T0 , T − = T − T − + R− T − . ij ij 0 ij 0 Poses of the right camera is similarly repeating equation (3.11) and (3.12) with + instead of −, Figure3.2. Figure 3.2 Upon knowing Rij , Tij the relative pose, and Ri , ci , the global rotation and translation of position i, Rj and cj are derived as follows. Let Γw , Γi , Γj be the coordinates of a 3D point in the coordinate system (X, Y, Z), (Xi , Yi , Zi ) and (Xj , Yj , Zj ) respectively. Then, ΓW = Ri (Rij Γj + Tij ) + Ci , (3.13) ΓW = Rj Γj + Cj . (3.14) Therefore, Rj = Ri Rij , (3.15) 9 Cj = Ri Tij + Ci . (3.16) This helps us update current pose of camera whenever it reaches a new position. Feature tracking and matching The estimation of relative pose between two views Ii and Ij starts from computing feature correspondences. Specifically SURF and ORB features are used here. The extraction of features provides both the location of a feature and its descriptor, which later will be the evidence to match features. Note that these features are localized to subpixel accuracy. After feature extraction, a two-stage feature matching is performed as follows: i) Features are tracked frame by frame applying the KLT tracking technique, and ii) for each feature in the left image, the search for a corresponding feature in the right image is restricted to a local patch of the location of the tracked feature. This is done by applying Lowe’s algorithm on a subset of features in the patch. In practice, approximately 3000 or so features are extracted in each frame, and nearly 95% or more of which are tracked to the next adjacent frame in time. In the second phase, by applying Lowe’s algorithm, a match is validated as a good match if and only if the distance between two descriptors is the least within a 3x3 patch, and if the distance is smaller than 70% of distance of the second ranking match. After this procedure, around 40% of the matched features remain active???. Essential(Fundamental) matrix An essential matrix E which describes the epipolar geometry between two cameras can be computed from a number of corresponding features. Observe that the essential matrix is defined under homogeneous image coordinate, in practice, as the features are given in pixel coordinates instead, computing a fundamental matrix F seems more intuitive, as F is defined on pixel coordinate system. Note that E and F are strictly related via the camera intrinsic matrix K by E = K T F K. Therefore, a complete path to obtaining (R, Tˆ) from feature correspon- dence is: i) Compute fundamental matrix F . ii) Compute essential matrix E. iii) Decompose (R, Tˆ) from E. Details of these three steps will be articulated below. Let F be the fundamental matrix between images Ii and Ij , and let pi , pj be pixels from images Ii and Ij , respectively. Each pair of feature correspondence should satisfy pTj F pi = 0, (3.17) where   f11 f12 f13   F = f12 f22 f23 .  (3.18) f13 f32 f33 10 Each point correspondence then gives an equation    h i  11 12 13  ξi  f f f ξj ηj 1  f12 f22 f23  ηi  = 0, (3.19)   f13 f32 f33 1 which can be reorganized as   f11   f21      ξ1i ξ1j ξ1i η1j ξ1i η1i ξ1j η1i ηj1 η1i ξ1j η1j 1 f31      . . . . . . . . . f   12       . . . . . . . . . f22  =0 (3.20)         . . . . . . . . .  f32     ξmi ξmj ξmi ηmj ξmi ηmi ξmj ηmi ηmj ηmi ξmj ηmj 1 f13      f   23  f33 with m corresponding points (ξki , ηki ) and (ξkj , ηkj ), k = 1,2,...,m provided. A minimum of 8 points will be sufficient to solve for the least-square solution of this system. This is the famous 8-point algorithm. In our case, RANSAC is used to find the fundamental matrix estimate. As it is common practice to extract pose from essential matrix instead of fundamental matrix, an essential matrix is computed from fundamental matrix by E = K T F K, where K is the camera intrinsic matrix acquired from camera calibration. Pose decomposition However, the essential matrix computed from 8-point algorithm might not necessarily be such a matrix. Therefore, in practice we project the matrix computed from 8-point algorithm to the subspace of essential matrices. The essential matrix E can be written as E = Tx R, where   0 −Tz Ty   Tx =   Tz 0 −T x  (3.21) −Ty Tx 0 . Observe that Tx is a skew-symmetric matrix, whose singular values, when performing SVD, are of the form diag(σ, σ, 0). Thus, the singular value decomposition of Tx gives Tx = U ΣV, (3.22) 11 where   σ 0 0   Σ=  0 σ 0   (3.23) 0 0 σ Then, E = U ΣV R = U ΣV¯ , (3.24) where V¯ = V R. Following Equation 3.24, EE T = U ΣV¯ V¯ T ΣT U T = U ΣΣT U T (3.25) Denote ΣΣT as D, then (EE T )U = U D = DU (3.26) is the eigen equation for EE T . Note that such projection yields 4 possible solutions as follows, ˆ = (U Rz (± π )ΣU T , U RzT (± π )ΣV T ) (Tˆ, R) (3.27) 2 2 In practice, to select the correct R and T out of 4 choices, we pick the one that gives positive depth to each 2D feature when triangulated to 3D. 12 Chapter 4 Untangling scale in pose estimation Let the relative pose of two views Ii and Ij be denoted by the rotation matrix Ri,j and by the translation vector Ti,j = λi,j Tˆi,j where λi,j is the magnitude of Ti,j and Tˆi,j is the unit vector representing the direction of Ti,j . This means that the expression of a 3D point Γ in the coordinate frame of camera i, Γi , is related to its expression in the coordinate frame of camera j, Γj , as Γj = Ri,j Γi + λi,j Tˆi,j (4.1) It is a standard excercise in multiview reconstructions that two images containing a sufficient num- ber of matching feautres give the essential matrix Ei,j which then gives Ri,j and Tˆi,j , leaving λi,j as a free parameter. The free parameter implies a family of reconstruction which can scale linearly as determinged by λi,j (the metric ambiguity of the reconstruction). A standard approach to visual Figure 4.1: The proposed approach determines the relative pose of each pair of adjacent keyframes from matching features, giving (n,n+1 , Tˆn,n+1 ) but leaving a free scale variable λn,n+1 for each pair of frames. In this paper we show how to calculate scale for adjacent pairs of frames. We show that scale-independent estimation of (n,n+1 , Tˆn,n+1 ) from the full set of features improves robustness. odmetry matches features between two initial keyframes, say A1 and A2 , in Figure ??, giving both the relative pose (R1,2 , Tˆ1,2 ), as well as a non-metric reconstruction in the form of a clould of 3D 13 points. Features are then tracked from keyframe A2 to adjacent non-keyframes up to and including keyframe A3 . The correspondence between tracked features and their 3D reconstruction is used in a standard PnP pose estimation [19] to give (R2,3 , λ2,3 Tˆ2,3 ), with no additional scale ambiguity beyond the initial scale λ1,2 . The world is also enriched with new features and the relative pose is adjusted incrementally. The pose of each new frame is determined using numerical optimiza- Figure 4.2: Direct pose estimation between two frames based on direct matching of features is more accurate than pose estimation based on tracking those features matched from a previous frame. We use the ground truth scale in comparing PnP to direct estimation to illustrate the improvement in isolation. This is shown for two distinct tracks. tion to solve for (R2,3 , λ2,3 Tˆ2,3 ), (R3,4 , λ3,4 Tˆ3,4 ), We observe that the optimization to solve from these six unknowns can be decomposed into estimating (Rn,n+1 ,Tˆn,n+1 ) and estimating λn,n+1 , since (Rn,n+1 ,Tˆn,n+1 ) can be computed directly with five points algorithm [8]. What remains is the estimation of a single scale λn,n+1 . This has two distinct advantages. First, the estimation of (Rn,n+1 ,Tˆn,n+1 ) should improve when we estimate it from the large number of matched features between the two keyframes (A2 , A3 ), typically, 2000 features, as opposed to those matched features of (A1 , A2 ) which can be tracked to A3 , say 500 features. This improvement is demonstrated in Figure. 4.2. Second, the independent estimation of (Rn,n+1 ,Tˆn,n+1 ) leaves a single variable λn,n+1 between frames An and An+1 , i.e., λ1,2 , λ2,3 , λ3,4 , etc. We now show below that the scale between each adjacent frames can be related to that of a previous adjacent pair of frames using only a single tracked features: Consider three frames I1 , I2 and I3 , and three corresponding image points γ1 , γ2 and γ3 , one per frame, all arising from a single 3D point Γ. Let Ii and Ij be related by relative pose Ri,j , λi,j as in Equation 4.1, i, j ∈ {1, 2, 3}, Figure 4.1. Assume (Ri,j , Tˆi,j ) are available but λi,j is unknown. Then λ23 [(eT1 Tˆ21 ) − (eT3 Tˆ21 )(eT1 γ1 )][(eT3 R23 γ2 )(eT1 γ3 ) − (eT1 R23 γ2 )] = (4.2) λ12 [(eT1 Tˆ23 ) − (eT3 Tˆ23 )(eT1 γ3 )][(eT3 R21 γ2 )(eT1 γ1 ) − (eT1 R21 γ2 )] 14 where eT1 = [1, 0, 0], eT2 = [0, 1, 0] and eT3 = [0, 0, 1]. Let Γ1 = ρ1 γ1 and Γ2 = ρ2 γ2 , where ρ1 and ρ2 are depths of point Γ in camera 1 and 2 respectively. Then the inner product of coordinate vectors eT1 , eT2 and eT3 with Equation 4.1 gives  T T T ˆ  ρ2 (e1 γ2 ) = ρ1 (e1 R12 γ1 ) + λ12 (e1 T12 )   ρ2 (eT2 γ2 ) = ρ1 (eT2 R12 γ1 ) + λ12 (eT2 Tˆ12 ) (4.3)   ρ (eT γ ) = ρ (eT R γ ) + λ (eT Tˆ ).  2 3 2 1 3 12 1 12 3 12 Since eT3 γ2 = 1, the last equation relates ρ2 to ρ1 ρ2 = (eT3 R12 γ1 )ρ1 + λ12 eT3 Tˆ12 . (4.4) Substituting this into the first equation of Equations 4.3 we have: (eT1 Tˆ12 ) − (eT3 Tˆ12 )(eT1 γ2 ) ρ1 = λ12 (4.5) (eT3 R12 γ1 )(eT1 γ2 ) − (eT1 R12 γ1 ) For three images, we have two expressions for ρ2 from Equation. 4.5, one from (I2 , I1 ), and one from (I2 , I3 ), ,  ρ2 = λ21 T(e1 Tˆ21 )−(e ˆ T TT T  3 21 )(e1 γ1 ) (e R γ )(eT γ )−(eT R γ 21 2 1 21 2 ) 3 1 1 (4.6)  ρ2 = λ23 (eT1 Tˆ23 )−(eT3 Tˆ23 )(eT1 γ3 ) (eT R γ )(eT γ )−(eT R γ 3 23 2 1 3 1 23 2 ) Then the ratio between two scales is λ23 [(eT1 Tˆ21 ) − (eT3 Tˆ21 )(eT1 γ1 )][(eT3 R23 γ2 )(eT1 γ3 ) − (eT1 R23 γ2 )] = (4.7) λ12 [(eT1 Tˆ23 ) − (eT3 Tˆ23 )(eT1 γ3 )][(eT3 R21 γ2 )(eT1 γ1 ) − (eT1 R21 γ2 )] Since not just one but numerous triplets of features are available, each provides an estimate for (a) (b) n,n+1 λ Figure 4.3: The distribution of λn−1,n for all features in the triple of image (In−1 , In , In+1 ) shown in full in (a) and magnified in (b). Observe that while the distribution is narrow, there is a range of possible values. Also note there are some outliers which can easily be identified. Disregarding these outliers we can examine the rest of the distribution. the relative scale, Figure 4.3. One straight forward approach to deciding the relative scale is to find 15 the average, peak, or a mean of this distribution. However, there is no clear bases for any of these options. Rather, the optimal relative scale is one that is most consistent with the data, excluding outliers. This calls for a RANSAC approach where the inlier set for each given relative scale is measured. Since the range and sampling for relative scale cannot be known apriori, we use existing samples of relative scale as the set of candidate relative scales. This one parameter RANSAC is highly efficient. Figure 4.4 demonstrates the performance of this algorithms in determining the relative scale with respect to ground truth. n,n+1 λ Figure 4.4: The comparing between ground truth relative scale λn−1,n and estimated relative scale with our method. In this figure, our method can estimate correct relative scale between keyframes. There are several wrong estimations, which will propagate. Trinocular Reprojection Error: The traditional approach to trinocular consistency is to first recover the 3D point by triangulation from two views, then reprojecting into a thrid view, and measureing the reprojection error. For example, in Figure 4.5. two points γ1 and γ2 triangulate to give Γ and the reprojection of Γ into a third view gives γ¯3 . The distance between γ3 and γ¯3 is the trinocular reprojection error. The difficulty is that is rays from two corresponding points γ1 and γ2 rarely meet in 3D due to calibrations and discretization errors. As such Γ is a point minimizing the distance from these non-intersecting rays. Alternatively, γ¯3 can be obtained by intersecting the respective epipolar lines correspding to γ1 and γ2 in the third view. this can be done analytically without actually intersection any epipolar lines as an expression for γ¯3 is available. !  a3¯b1 −¯ (a2 b1 −a1 b2 )(¯ a1¯b3 )−(a3 b1 −a1 b3 )(¯ a2¯b1 −¯a1¯b2 )  ξ a3¯b1 −¯ (a2 b3 −a3 b2 )(¯ a1¯b3 )−(a3 b1 −a1 b3 )((¯ a2¯b3 −¯ a2¯b3 ) γ¯3 = =  (¯a2¯b1 −¯a1¯b2 )−(¯a2¯b3 −¯ a3¯b2 ) (a2 b1 −a1 b2 )(¯ a3¯b1 −¯ a1¯b3 )−(a3 b1 −a1 b3 )(¯ a2¯b1 −¯ a1¯b2 )  (4.8) η (¯a ¯b −¯ a ¯b ) (a b −a b )(¯ a ¯b −¯ a ¯b )−(a b −a b )((¯ a ¯b −¯a ¯b ) 3 1 1 3 2 3 3 2 3 1 1 3 3 1 1 3 2 3 2 3 ¯i = eTi R2,3 γ2 and ¯bi = eTi T2,3 . The formal proof is shown in where ai = eTi R1,3 γ1 , bi = eTi T1,3 , a supplementary material. This avoids an unnecessary compromise in accurracy as a result of casting the problem as a 3D reconstruction and then a projection. 16 Figure 4.5: (a) The 3D reprojection error is typically computed by first triangulating points and then reporjecting into another view. (b) Alternatively, the point can be directly estimated without triangulation, effectively as the intersection of epipolar lines. Monocular Visual Odometry: The above method gives the relative scale between two adjacent frame pairs. Given that (Rn,n+1 , Tˆn,n+1 ) are fully estimated independently, the only unknown is λn,n+1 the set of scales λn,n+1 . The knowledge of relative scale λn−1,n reduces the set of unknowns to a single unknown, e.g.λ1,2 = λ, where all others can be determined in cascade. There are techniques for determining this global scale λ, for example estimating the ground plane and known camera height [20]. However, observe that any error in any stage of this sequential process propagates errors, compounding the error in each subsequent stage. Stereo Visual Odometry: Alternatively, when stereo imaging is available, instead of a single frame (a) (b) (c) (d) Figure 4.6: (a) Stereo visual odometry provides absolute scale and prevents propagation of scale errors. (b),(c) different ways of establishing absolute scale. (d) regularization of absolute scale. Ai we have a pair of frames from the left and right cameras, denoted by (A− + i , Ai ). Observe that the 17 scale of the pair A− + i and Ai is available due to prior stereo calibration. The use of absolute scale from the stereo pair, decomposes the problem of scale determination into an independent set of local scale problem. This avoids a global, cascaded scale determination, thus preventing propagation and compounding of errors. Observe that this can be done in a ”forward”, Figure 4.6(b) or ”backward” manner, Figure 4.6(c), allowing for a regularization of local scale bond two sources. In addition, the continuity of absolute scale can be gauged by measuring relative scale in a sequential fashion, Figure 4.6(d). 18 (b) (a) (c) (e) (d) (f) Figure 4.7: The result of KITTI00. (a) Output path. (b) Comparison of rotation error. (c) Compar- ison of translation error. The result of KITTI07. (d) Output path. (e) Comparison of rotation error. (f) Comparison of translation error. 19 Chapter 5 Implementation Details The system is implemented in C++. It consists of the following major modules. Note that some open source libraries introduced includes opencv, g2o, eigen, lmmin, etc. ViewReader This is a class that reads in images captured from cameras. It is supposed to orga- nize each view as a pair of stereo images. FeatureExtractor This is a class that extracts features, according to specification of which type of feature, and compute the descriptor of corresponding feature. FeatureTracker FeatureTracker tracks feature along frames. A decision of a good match or a bad one is made here. ViewTracker This class tracks all frames and landmarks. The 2D-2D, 2D-3D correspondences are well maintained here and available for lookup at any time efficienty. PoseEstimator This class implements different pose estimation methods. View This is a base class that defines a view. A view contains a pair of images, features in the images, feature ids, as well as the pose of this view. Feature This is a base class that defines a feature. A feature contains its type (SURF, ORB, etc), its descriptor, and its id that relates it to the corresponding 3D landmark. Converter This is a helper class. It is responsible for the conversions between different data types. As various open source libraries are introduced in the implementation, it is useful to have a con- verter that eliminates the gaps between different libraries. Canvas This is a helper class. It is responsible for all the drawings. For instance, it draws feature correspondences connected with lines on images and output to user window. 20 Chapter 6 Experimental Results Dataset and Experiments: Evaluation was performed on the KITTI benchmark [21], where datasets are captured at 10Hz by driving around the city covering scenes from urban area, rural area, and highways. Our method is evaluated on tracks 00-10 with the exception of 01, which does not contain sufficient trackable features, and 04, where no obvious rotation was available for validation purpose. Some of the results are reported here, while others will be accessible in the supplementary materials. Results and Discussion: KITTI benchmark contains accurate ground truth of both rotation and translation frame by frame, therefore allows us to compare approaches in details. Note that in order to be focused, no other further refinement, such as bundle adjustment or loop closure, are introduced to either method. Figure 4.7 shows the output path from different approaches. In general dissecting scale performs much better than the classical PnP method. For the interest of more insight, we demonstrate the rotation error and translation error of different approaches, including utilizing trinocular projection error, in Figure. 4.7 as well. As the figure il- lustrates, scale-decoupled pose estimation reduces the error of classical PnP approach. Moreover, using trinocular reprojection error instead of popular triangulation and reprojection error gives slight improvement. In fact, trinocular reprojection performs slightly better in all sequences we experi- mented on, thus convinces us its robustness over triangulation and reprojection. Failure modes 1. The estimation of essential matrix could go wrong. A typical incorrect esti- mate of essential matrix E defines an epipolar geometry illustrated by 6.3. Note that when the camera is moving forward, as most of the cases are with KITTI dataset, the epipole is supposed to locate within the image. However, an incorrect estimate sometimes finds the epipole off the side of the image. This is exactly what 6.3 is showing: those nearly parallel lines intersect at a potential 21 Figure 6.1: A comparison of scale/speed estimation of the PnP-based algorithm (left column) and ours (middle column) against the ground truth. The distribution of errors are plotted in the right column. Rows corresponds to tracks 00 and 07 of KITTI. Figure 6.2: All tracks we experimented on. Left: Average translation Error. Middle: Average Rotation Error. Right: Average Speed Error. epipole to the left of the image. Such an incorrect pose estimate usually produces a horizontal jump in the output path. Moreover, it could potentially impact the poses of future frames, as the path is integrated frame by frame. 2. The estimation of relative pose is based on the assumption that most of the features in two con- secutive frames move in same direction and that the direction reflects the real motion of the camera. However this is not always true. Consider the following circumstance, where a car halts at a stop sign, and other cars pass by horizontally. This is not under the assumption above and might estimate a lateral motion. See 6.4 for an example. The truck in the images carries most of the features in the lateral motion. 3. Further works need to be done in order to run odometry on BlindFind dataset. One of the BlindFind sequence is captured in the Providence City Mall. Although we don’t have accurate 22 ground truth as it is in KITTI, it is known that the track is a loop that closes itself at the very end. This becomes a good way to validate the experiment result. However, at this point the system fails on this dataset. One of the reason is that at some point the number of features drops too quickly. This might have something to do with the quality of the image, or blurring. Figure ?? shows an example of feature extraction and matching around the failure area. More investigation is called for so as to figure out what happened and resolve it. 4. BlindFind dataset has another issue that leads to insufficient features. In a turning scenario, the images get blurred as in Figure ??. Figure 6.3: The epipolar lines and epipole from ground truth pose, the ones that intersect at the center of the image, and those nearly parallel lines from algorithm estimate. 23 Figure 6.4: This depicts an example of failure mode 2. Figure 6.5: (a) The features extracted from last frame. (b) After matching only very few of them stay. 24 Figure 6.6: The images are blurred and not enough features could be tracked. 25 Chapter 7 Conclusions And Future Works Visual odometry has become one of the most popular topic in computer vision research and appli- cations, as it produces very useful information from very low-cost hardware: cameras. The key to visual odometry is estimating the relative pose between two consecutive frames. In this thesis a contribution to visual odometry over classical methods is untangling the scale estimation from other 5 degrees of freedom: rotation and unit translation. It has been proven in the chapters above that this approach achieves a better result compared to classical PnP-based methods when computing a relative pose with information limited to a local neighborhood. However, lots of related works bring the problem to a more global scope using global bundle adjust- ment and loop closure techniques to enhance the performance. A potential future work on this thesis could be a refinement on the pose estimations using the information from a larger neighborhood. Current approach has one disadvantage that a single incorrect estimate of rotation or translation could lead to a drift in the output path since then. Although a scale-dissected pose estimation is proven to be more accurate, it should be given a second chance to correct itself. Consider three con- secutive frames Ii , Ij and Ik , where relative poses (Rij , Tij ) and (Rjk , Tjk ) are already computed. As the relative pose between Ii and Ik , (Rik , Tik ), could be derived by combining (Rij , Tij ) and (Rjk , Tjk ), an optimization can be done within these three frames by tweaking these three relative pose estimates slightly to minimize the re-projection errors. One of the reasons why this optimiza- tion would work is that we have proven the relative pose estimation from monocular technique is very accurate in most cases. Therefore it is hardly likely to see two incorrect estimates out of these three. Optimizing within them can be expected to refine the poses. In addition, if common features exist in a larger local neighborhood, more than three frames can be taken into this optimization scheme. 26 Further more, as right now only the left images are taken into account when estimating a rela- tive pose from monocular technique, another attempt to enhance the performance is to include the information from right images into consideration as well. One idea could be minimizing the re- projection error in both left and right images to optimize the relative pose. Even a simple average of the estimate might reduce the error. 27 Bibliography [1] Mark Maimone, Yang Cheng, and Larry Matthies. Two years of visual odometry on the mars exploration rovers. Journal of Field Robotics, 24(3):169–186, 2007. [2] Gim Hee Lee, Friedrich Faundorfer, and Marc Pollefeys. Motion estimation for self-driving cars with a generalized camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2746–2753, 2013. [3] Albert S Huang, Abraham Bachrach, Peter Henry, Michael Krainin, Daniel Maturana, Dieter Fox, and Nicholas Roy. Visual odometry and mapping for autonomous flight using an rgb-d camera. In Robotics Research, pages 235–252. Springer, 2017. [4] Bernd Kitt, Andreas Geiger, and Henning Lategahn. Visual odometry based on stereo image sequences with ransac-based outlier rejection scheme. In Intelligent Vehicles Symposium (IV), 2010 IEEE, pages 486–492. IEEE, 2010. [5] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. [6] Christian Kerl, J¨urgen Sturm, and Daniel Cremers. Robust odometry estimation for rgb-d cameras. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 3748–3754. IEEE, 2013. [7] David Nist´er, Oleg Naroditsky, and James Bergen. Visual odometry. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 1, pages I–I. Ieee, 2004. [8] David Nist´er. An efficient solution to the five-point relative pose problem. IEEE transactions on pattern analysis and machine intelligence, 26(6):756–770, 2004. 28 [9] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cam- bridge university press, 2003. [10] Andreas Geiger, Julius Ziegler, and Christoph Stiller. Stereoscan: Dense 3d reconstruction in real-time. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pages 963–968. Ieee, 2011. [11] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. [12] Georg Klein and David Murray. Parallel tracking and mapping for small ar workspaces. In Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Sympo- sium on, pages 225–234. IEEE, 2007. [13] Alessandro Chiuso, Paolo Favaro, Hailin Jin, and Stefano Soatto. 3-d motion and structure from 2-d motion causally integrated over time: Implementation. Computer VisionECCV 2000, pages 734–750, 2000. [14] Igor Cviˇsi´c and Ivan Petrovi´c. Stereo odometry based on careful feature selection and tracking. In Mobile Robots (ECMR), 2015 European Conference on, pages 1–6. IEEE, 2015. [15] Davide Scaramuzza, Friedrich Fraundorfer, and Roland Siegwart. Real-time monocular vi- sual odometry for on-road vehicles with 1-point ransac. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 4293–4299. Ieee, 2009. [16] Davide Scaramuzza, Andrea Censi, and Kostas Daniilidis. Exploiting motion priors in visual odometry for vehicle-mounted cameras with non-holonomic constraints. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pages 4469–4476. IEEE, 2011. [17] Menglong Zhu, Srikumar Ramalingam, Yuichi Taguchi, and Tyler Garaas. Monocular visual odometry and dense 3d reconstruction for on-road vehicles. In Computer Vision–ECCV 2012. Workshops and Demonstrations, pages 596–606. Springer, 2012. [18] Yanhua Jiang, Huiyan Chen, Guangming Xiong, and Davide Scaramuzza. Icp stereo visual odometry for wheeled vehicles based on a 1dof motion prior. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 585–592. IEEE, 2014. [19] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision, 81(2):155–166, 2009. 29 [20] Bhoram Lee, Kostas Daniilidis, and Daniel D Lee. Online self-supervised monocular visual odometry for ground vehicles. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 5232–5238. IEEE, 2015. [21] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3354–3361. IEEE, 2012. 30