Scale-dissected Pose Estimation in Visual Odometry


                              by
                          Rong Yuan


        A Thesis submitted in partial fulfillment of the
       requirements for the Degree of Master of Science
       in the School of Engineering at Brown University


                  Providence, Rhode Island
                          May 2017

c Copyright 2017 by Rong Yuan
                AUTHORIZATION TO LEND AND REPRODUCE THESIS


As the sole author of this thesis, I authorize Brown University to lend it to other institutions
                     or individuals for the purpose of scholarly research.


 Date:                                Signature:
                                                            Rong Yuan, Author


   I further authorize Brown University to reproduce this thesis by photocopying or other
means, in total or in part, at the request of other institutions or individuals for the purpose of
                                      scholarly research.


 Date:                                Signature
                                                          Rong Yuan, Author
        This thesis by Rong Yuan is accepted in its present form by
       the School of Engineering as satisfying the thesis requirement
                    for the degree of Master of Science.


Date
                                     Benjamin B. Kimia, Advisor
                                       School of Engineering


                    Approved by the Graduate Council


Date
                                        Andrew G. Campbell
                                     Dean of the Graduate School


                                    iii
Acknowledgements

To my advisor, Prof. Benjamin Kimia, thank you. Thank you for bringing me into the field I might
devote my future career to. It was the best part of my graduate study.


Many thanks to all LEMS colleagues. I appreciate every bit of help and encouragement from you.


A special gratitude to Hongyi, my partner in BlindFind in the past two years. Good luck to you
in the future BlindFind project.


I’m so grateful to have this precious experience during my graduate study and my life. Thank
you again to you all.


                                                 iv
Contents

List of Tables                                vi

List of Figures                               vii

1   Introduction                               1

2   Related Works                              3

3   Framework and Formulation                  5

4   Untangling scale in pose estimation       13

5   Implementation Details                    20

6   Experimental Results                      21

7   Conclusions And Future Works              26

Bibliography                                  28


                                          v
List of Tables


                 vi
List of Figures

 1.1   BlindFind Sysmtem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .           2

 3.1   This figure illustrates the setup: a stereo camera moves along path C(t) and also
       rotates to change the viewpoint so that the pose is completely described by R(t)
       and T(t). (a) Stereo visual odometry computations are based on a pair of image at
       time ti and time tj . We use Ri = R(ti ) and Rj = R(tj ) for shorthand. (b) The
       indices i and j refer to coordinate at times ti and tj . We use a central coordinate
       system represented by (X, Y, Z) to represent the left and right coordinate system
       (X − , Y − , Z − ) and (X + , Y + , Z + ), respectively. (c) An example of the left and
       right image at time ti and tj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    6
 3.2     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    9

 4.1   The proposed approach determines the relative pose of each pair of adjacent keyframes
       from matching features, giving (n,n+1 , Tˆn,n+1 ) but leaving a free scale variable
       λn,n+1 for each pair of frames. In this paper we show how to calculate scale for
       adjacent pairs of frames. We show that scale-independent estimation of (n,n+1 ,
       Tˆn,n+1 ) from the full set of features improves robustness. . . . . . . . . . . . . . .     13
 4.2   Direct pose estimation between two frames based on direct matching of features is
       more accurate than pose estimation based on tracking those features matched from a
       previous frame. We use the ground truth scale in comparing PnP to direct estimation
       to illustrate the improvement in isolation. This is shown for two distinct tracks. . .       14
                             λn,n+1
 4.3   The distribution of   λn−1,n   for all features in the triple of image (In−1 , In , In+1 )
       shown in full in (a) and magnified in (b). Observe that while the distribution is
       narrow, there is a range of possible values. Also note there are some outliers which
       can easily be identified. Disregarding these outliers we can examine the rest of the
       distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    15


                                                 vii
                                                               λn,n+1
4.4   The comparing between ground truth relative scale        λn−1,n   and estimated relative
      scale with our method. In this figure, our method can estimate correct relative scale
      between keyframes. There are several wrong estimations, which will propagate. . .           16
4.5   (a) The 3D reprojection error is typically computed by first triangulating points
      and then reporjecting into another view. (b) Alternatively, the point can be directly
      estimated without triangulation, effectively as the intersection of epipolar lines. . .     17
4.6   (a) Stereo visual odometry provides absolute scale and prevents propagation of scale
      errors. (b),(c) different ways of establishing absolute scale. (d) regularization of
      absolute scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   17
4.7   The result of KITTI00. (a) Output path. (b) Comparison of rotation error. (c) Com-
      parison of translation error. The result of KITTI07. (d) Output path. (e) Comparison
      of rotation error. (f) Comparison of translation error. . . . . . . . . . . . . . . . .     19

6.1   A comparison of scale/speed estimation of the PnP-based algorithm (left column)
      and ours (middle column) against the ground truth. The distribution of errors are
      plotted in the right column. Rows corresponds to tracks 00 and 07 of KITTI. . . . .         22
6.2   All tracks we experimented on. Left: Average translation Error. Middle: Average
      Rotation Error. Right: Average Speed Error.       . . . . . . . . . . . . . . . . . . . .   22
6.3   The epipolar lines and epipole from ground truth pose, the ones that intersect at the
      center of the image, and those nearly parallel lines from algorithm estimate. . . . .       23
6.4   This depicts an example of failure mode 2. . . . . . . . . . . . . . . . . . . . . . .      24
6.5   (a) The features extracted from last frame. (b) After matching only very few of them
      stay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   24
6.6   The images are blurred and not enough features could be tracked. . . . . . . . . . .        25


                                              viii
Chapter 1

Introduction

Visual Odometry(VO), the procedure for the construction of the trajectory of the moving platform
from videos captured by one or more camera mounted on the moving platform has become an
increasingly important problem in computer vision. This is due to an increasing range of applica-
tions, robotics [1], autonomous driving cars [2], drones or unmanned aerial vehicles(AUV) [3], and
personal navigation. Due to the affordability and environmental compatibility of visual odometry,
comparing with its alternative method such as radar and GPS, a more robust and accurate visual
odometry method has the ability to lead to a significant progress in this area, as well as the progress
in a wide range of applications.
    This thesis aims to solve the odometry problem for BlindFind, a project aimed at aiding the
visually-impaired in navigation by building an wearable computer vision rig. The focus of BlindFind
in this project is indoor navigation, where it is expected to react to voice commands, plan the paths,
give feedback and direct the user to his destination. The device is equipped with a voice-recognizer
that listens to user commands, and a haptic belt that generates feedback by vibrating in a certain
direction. The core of Blindfind is a Visual Odometry module, as that is what brings the user to
a designated place, Fig ??. Early implementation of the visual odometry module was relying on
running open source visual odometry libraries, specifically LibViso2 [4]. This is one of the first
open sources visual odometry libraries. It uses the classical PnP-based approach that generates 3D
landmarks from the stereo pair in current frame, and minimizes reprojection error in the next ad-
jacent frame to obtain a relative pose estimate. The key problem in visual odometry is to estimate
the 6DOF pose (3 parameters for rotation and 3 parameters for translation) from the correspon-
dence points from input image sequences using pin hole camera model and projective geometry.
Traditionally, just like what Libviso2 does, after getting the correspondence points, the pose param-
eters of each image was estimated by solving an optimization problem, using an outlier rejection


                                                  1
scheme such as RANSAC. Such methods have reached appreciable performances in some bench-
marks. However, the optimization process suffers numerical issues which makes the result not
robust enough, and the cost of the RANSAC or alternative scheme is relatively high. Observe that
the monocular technique is able to estimate 5 degrees of freedom, i.e., a rotation vector and a unit
translation vector fairly accurately. Thus, it is not necessary to solve for rotation, unit translation and
scale all at the same time, which complicates the optimization problem. Rather, we propose a visual
odometry approach that dissects the scale estimation from pose calculation. Instead of optimizing 6
parameters simultaneously, our approach only solves the 1DOF problem by using an analytical out-
lier rejection scheme which gives large improvements in robustness and efficiency. Theoretically,
the number of iterations N that is necessary to guarantee that a correct solution can be computed
          log(1−p)
with   log(1−(1−)s ) ,   where s is the number of minimal data points,  is the percentage of outliers in
the data points and p is the requested probability of success [5]. With the traditional 6DOF scheme,
if we picked 5 points in a bundle, at least 100 iterations would be called for. Our outlier rejection
scheme takes in one single input at a time, leading the iteration number to decrease dramatically.
    Besides, traditional methods rely highly on the 3D reconstruction, which faces the issue that it
is sometimes hard to guarantee that the projection lines from two images intersect at the same point,
setting instead a mid-point solution. In this paper, we propose an approach that uses epipolar geom-
etry bypassing the triangulation procedure, rather using a direct estimation where triangulation can
be prevented. The experiments show that our scheme generates competitive results when compared
to many state-of-art methods.


                                   Figure 1.1: BlindFind Sysmtem Overview

                                                       2
Chapter 2

Related Works

Current visual odometry method can be generally grouped into three types, based on the different
data they use. First, 3D-to-3D methods use 3D world constructed from two consecutive image pairs.
The input of this type of method comes from stereo camera [1] or RGB-D camera [6] which are able
to build cloud of points for each time stamp. After that, the incremental motion is computed from
the correspondence of 3D points. The performance of this type of method highly depends on the
precision of the 3D points, thus the alternative method was used. [6] uses 2D-to-2D correspondence
to estimate the quadrifocal tensor. Second, 2D-to-2D methods use the correspondence between im-
ages to estimate the motion between two frames, Nister [7] used his five point algorithm [8] to solve
the relative rotation and translation up to a scale. The key step in this type of methods is the esti-
mation of the scale. The scale cannot be computed from the 2D space. A traditional solution [9]
is to put every point into 3D, calculate the scale for each point, then use the mean of scales of all
points, which is not stable enough. Neither 3D-to-3D nor 2D-to-2D could perfectly solve the prob-
lem alone. 3D-to-2D scheme reaches better performance than previous two schemes. The idea of
this method is trying to minimize the reprojection error from 3D-to-2D points correspondence. A
milestone in this area, [10], computes 3D points from stereo pair, and then minimize the reprojection
error. Based on this scheme, Kitt [4] proposed an approach which added outlier rejection scheme
RANSAC. With the inspiration from structure from motion and multiview stereo area, numbers of
successful SLAM systems [11, 12, 13] were using this scheme which achieved very good results by
deploying global bundle adjustment. [?] combined feature tracking, pose estimation and local bun-
dle adjustment which reached state-of-the-art performance. In all these works, bundle adjustment
is a powerful tool to compensate the drift. However, the computational cost of bundle adjustment is
high. Based on this, [14] uses the idea from 2D-to-2D scheme, they solved 3-dimensional rotation
matrix by using five points algorithm in closed form, then optimized the 3-dimensional translation


                                                  3
by minimizing reprojection error. This method reaches one of the best performance without bundle
adjustment. There also have been other attempts from researchers to boost the robustness of the
reprojection error. [15, 16, 17, 18] used motion prior to constrain the vehicle motion, which made
the optimization problem a lower dimensional problem (1D or 2D).
   Our method has three main differences from previous works. First, we decouple the 1DOF
optimization problem from the other 5 dimensions. Second, instead of reprojection error, we use
trinocular model as the geometric model, which prevents the estimation error in triangulation. Fi-
nally, we propose an analytical 1-point RANSAC scheme to increase the efficiency of traditional
6DOF RANSAC scheme.


                                                4
Chapter 3

Framework and Formulation

Problem Statement Odometry refers to computing the trajectory of a moving object, e.g., a robot.
Visual odometry refers to computing the trajectory based on one or more cameras, as opposed
to inertial sensor (IMU). Monocular visual odometry uses one camera for this task, while stereo
visual odometry uses a pair of cameras. Monocular visual ododmetry has the disadvantage of scale
ambiguity, i.e., the recovered trajectory is known up to a scale factor, while stereo odometry results
in a unique trajectory with known scale. In addition, the extra image in stereo odometry provides
a second source of information leading to a more robust result, assuming the distance between the
two camera is reasonably large compared to distance to the object under view, a well-known result
from stereo vision.


Problem Formulation A stereo camera is moving along a trajectory C(t). In our approach, the
stereo rig is represented by a “central coordinate system” so that the pose of the right and left
cameras with respect to the central coordinate system are represented by (R0+ , T0+ ) and (R0− , T0− ),
respectively, In other words, a point in the central camera coordinate system expressed as Γ =
(X, Y, Z) in the left camera coordinate system by Γ− = (i− , Y − , Z − ) and expressed as Γ+
                                                                                           i =
(X + , Y + , Z + ) in the right camera coordinate system satisfy the following relationships:


                                                        
             Γi = R− Γ− + T −                           Γ− = (R− )−1 Γi − (R− )−1 T −
                    0 i     0                              i     0            0      0
                                             (3.1)                                                (3.2)
             Γ = R+ Γ+ + T +                            Γ+ = (R+ )−1 Γ − (R+ )−1 T +
               i    0 i     0                              i     0      i     0      0


                                                     5
             ((a))                                                  ((b))


                              ((c))

Figure 3.1: This figure illustrates the setup: a stereo camera moves along path C(t) and also rotates
to change the viewpoint so that the pose is completely described by R(t) and T(t). (a) Stereo visual
odometry computations are based on a pair of image at time ti and time tj . We use Ri = R(ti ) and
Rj = R(tj ) for shorthand. (b) The indices i and j refer to coordinate at times ti and tj . We use a
central coordinate system represented by (X, Y, Z) to represent the left and right coordinate system
(X − , Y − , Z − ) and (X + , Y + , Z + ), respectively. (c) An example of the left and right image at time
ti and tj .


                                                    6
In our case the stereo cameras are parallel so that
                                                          
                                             1 0 0          ±d
                                        ±
                                                  ±  
                                       R0 = 0 1 0 , T0 =  0 
                                                         
                                                               ,                                         (3.3)
                                             0 0 1           0

where 2d is the distance between two camera centers.


For each camera, the relationship between pixel coordinates and metric image coordinates can be
expressed through the intrinsic matrix K, whose elements are based on knowledge of camera center
(ξo , ηo ) in pixel units, αx , the ratio of focal length f and width of a pixel, αy , the ratio of focal length
f and height of a pixel, and sθ the skew factor which we take to be zero. These values are obtained
from a calibration step done by cameraCalibration method in Matlab. Let γ T = [γξ ; γη ; 1] be the
homogeneous image coordinate of a pixel (ξ, η), the connection between these two is described as,

                                                               
                    ξ        γx                αξ sθ ξo      α ξ 0 ξo
                                                               
                   η  = K γy  , where K =  0 αη ηo  =  0 αη ηo  .                                 (3.4)
                                                               
                    1        1                  0 0 1         0 0 1


The problem is to recover the camera trajectory by finding the pose of each camera with respect
to the world coordinate, Figure 3.1(a). It is well known that the pose between two cameras can be
computed up to scale by finding a sufficient number of corresponding features in acquired images.
The typical visual odometry approach is to find the relative pose between two camera samples in
time and to construct and integrate path over time.


Let the relative rotation and translation between the central camera coordinate systems at time ti
and tj be Rij and Tij Figure 3.1(b), i.e., the position of a point expressed in the central camera
coordinate system (Xi , Yi , Zi ) at time ti , Γi , is related to the central camera coordinate system
(Xj , Yj , Zj ) at time tj , Γj , by


                                                             −1       −1
                              Γi = Rij Γj + Tij ,      Γj = Rij Γi − Rij Tij .                            (3.5)


Since the only information about relative pose is based on matching features in the left and right
                                                       7
cameras, the computation of (Rij , Tij ) is necessarily tied to computing (R± , T± ), namely, the
relative pose between each of the left and right cameras at time ti and tj , respectively, i.e. a point
expressed in the left/right camera coordinate system at time ti , Γ±
                                                                   i , is related to the expression in
the left/right camera coordinate system at time tj , Γ±
                                                      j , as


                     Γ±    ± ±       ±
                      i = Rij Γj + Tij ,         Γ±     ± −1 ±      ± −1 ±
                                                  j = (Rij ) Γi − (Rij ) Tij .                       (3.6)

                                                           −
Monocular visual odometry based on the left camera gives (Rij , Tij− ), and monocular visual odome-
                                      +                                                        ± ±
try based on the right camera gives (Rij , Tij+ ). The relationship between (Rij , Tij ) and (Rij Tij ) can
be easily derived. Considering the left camera, for example, can be written as functions of (Rij and
Tij ) by first relating Γ−                                                  −     −
                         i to Γi via the first part of Equation(3.1), then Γi to Γj via Equation(3.6),
then relating Γj to Γ−
                     j via the first part of Equation(3.2):


                 Γi = R0− Γ−    −
                           i + T0
                            − −
                    = R0− (Rij Γj + Tij− ) + T0−
                            −
                    = R0− (Rij ((R0− )−1 Γj − (R0− )−1 T0− ) + Tij− ) + T0−
                    = [R0− Rij
                            −
                               (R0− )−1 ]Γj + [−R0− Rij
                                                     −
                                                        (R0− )−1 T0− + R0− Tij− + T0− ].             (3.7)

resulting in
                            
                            Rij            −
                                     = R0− Rij (R0− )−1
                                                                                                     (3.8)
                            T
                                ij
                                                             −
                                     = R0− Tij− + T0− − R0− Rij (R0− )−1 T0

             −
Conversely, Rij and Tij− can be written as functions of Rij and Tij as

                       
                       R−      = (R0− )−1 Rij R0−
                          ij
                                                                                                     (3.9)
                       T −     = (R0− )−1 Tij − (R0− )−1 T0− + Rij (R0− )−1 T0−
                           ij

     +
and Rij and Tij+ can be written as functions of Rij and Tij as
                       
                       R+
                          ij    = (R0+ )−1 Rij R0−
                                                                                                    (3.10)
                       T +     = (R0+ )−1 Tij − (R0+ )−1 T0+ + Rij (R0+ )−1 T0+
                           ij


In our case described in equation 3.3, R0± are identity matrices. Thus,


                                                     8
                                                          
     Rij          −                                       R− = Rij
              =   Rij                                         ij
                                            (3.11)                                              (3.12)
     T
         ij   = Tij− + T0− − Rij
                              − −
                                 T0 ,                      T − = T − T − + R− T − .
                                                              ij     ij    0      ij 0


    Poses of the right camera is similarly repeating equation (3.11) and (3.12) with + instead of −,
Figure3.2.


                                              Figure 3.2

Upon knowing Rij , Tij the relative pose, and Ri , ci , the global rotation and translation of position
i, Rj and cj are derived as follows.


Let Γw , Γi , Γj be the coordinates of a 3D point in the coordinate system (X, Y, Z), (Xi , Yi , Zi )
and (Xj , Yj , Zj ) respectively. Then,


                                    ΓW = Ri (Rij Γj + Tij ) + Ci ,                              (3.13)

                                          ΓW = Rj Γj + Cj .                                     (3.14)


Therefore,


                                            Rj = Ri Rij ,                                       (3.15)

                                                     9
                                          Cj = Ri Tij + Ci .                                    (3.16)


This helps us update current pose of camera whenever it reaches a new position.


Feature tracking and matching The estimation of relative pose between two views Ii and Ij starts
from computing feature correspondences. Specifically SURF and ORB features are used here. The
extraction of features provides both the location of a feature and its descriptor, which later will be
the evidence to match features. Note that these features are localized to subpixel accuracy. After
feature extraction, a two-stage feature matching is performed as follows: i) Features are tracked
frame by frame applying the KLT tracking technique, and ii) for each feature in the left image, the
search for a corresponding feature in the right image is restricted to a local patch of the location of
the tracked feature. This is done by applying Lowe’s algorithm on a subset of features in the patch.
In practice, approximately 3000 or so features are extracted in each frame, and nearly 95% or more
of which are tracked to the next adjacent frame in time. In the second phase, by applying Lowe’s
algorithm, a match is validated as a good match if and only if the distance between two descriptors
is the least within a 3x3 patch, and if the distance is smaller than 70% of distance of the second
ranking match. After this procedure, around 40% of the matched features remain active???.


Essential(Fundamental) matrix An essential matrix E which describes the epipolar geometry
between two cameras can be computed from a number of corresponding features. Observe that the
essential matrix is defined under homogeneous image coordinate, in practice, as the features are
given in pixel coordinates instead, computing a fundamental matrix F seems more intuitive, as F is
defined on pixel coordinate system. Note that E and F are strictly related via the camera intrinsic
matrix K by E = K T F K. Therefore, a complete path to obtaining (R, Tˆ) from feature correspon-
dence is: i) Compute fundamental matrix F . ii) Compute essential matrix E. iii) Decompose (R,
Tˆ) from E. Details of these three steps will be articulated below.
    Let F be the fundamental matrix between images Ii and Ij , and let pi , pj be pixels from images
Ii and Ij , respectively. Each pair of feature correspondence should satisfy

                                             pTj F pi = 0,                                      (3.17)

where
                                                     
                                          f11 f12 f13
                                                     
                                      F =
                                         f12 f22 f23
                                                      .
                                                                                               (3.18)
                                          f13 f32 f33

                                                  10
Each point correspondence then gives an equation
                                                      
                           h          i  11 12 13  ξi 
                                          f    f  f
                             ξj ηj 1   f12 f22 f23  ηi  = 0,                                            (3.19)
                                                       
                                          f13 f32 f33   1
which can be reorganized as
                                                                                                  
                                                                                                  f11
                                                                                                  
                                                                                                 f21 
                                                                                               
                                                                                                      
          ξ1i ξ1j   ξ1i η1j   ξ1i   η1i ξ1j       η1i ηj1            η1i     ξ1j        η1j   1 f31 
                                                                                                   
         .            .       .       .               .              .         .        .    .  f
                                                                                                  
                                                                                                   12
                                                                                                
                                                                                                   
         .            .       .       .               .              .         .        .    . f22  =0   (3.20)
                                                                                              
                                                                                                
                                                                                                   
         .            .       .       .               .              .         .        .    . 
                                                                                                 f32 
                                                                                                     
        
         ξmi ξmj    ξmi ηmj   ξmi ηmi ξmj         ηmi ηmj            ηmi ξmj            ηmj   1 f13 
                                                                                                  
                                                                                                  
                                                                                                 f 
                                                                                                  23 
                                                                                                  f33
with m corresponding points (ξki , ηki ) and (ξkj , ηkj ), k = 1,2,...,m provided.
A minimum of 8 points will be sufficient to solve for the least-square solution of this system. This
is the famous 8-point algorithm. In our case, RANSAC is used to find the fundamental matrix
estimate.
    As it is common practice to extract pose from essential matrix instead of fundamental matrix,
an essential matrix is computed from fundamental matrix by E = K T F K, where K is the camera
intrinsic matrix acquired from camera calibration.


Pose decomposition However, the essential matrix computed from 8-point algorithm might not
necessarily be such a matrix. Therefore, in practice we project the matrix computed from 8-point
algorithm to the subspace of essential matrices. The essential matrix E can be written as E = Tx R,
where
                                                                                   
                                                   0        −Tz            Ty
                                                                              
                                     Tx = 
                                           Tz                  0          −T x
                                                                                                            (3.21)
                                           −Ty                  Tx          0
.
Observe that Tx is a skew-symmetric matrix, whose singular values, when performing SVD, are of
the form diag(σ, σ, 0). Thus, the singular value decomposition of Tx gives

                                                  Tx = U ΣV,                                                 (3.22)
                                                           11
where
                                                  
                                            σ 0 0
                                                  
                                         Σ=
                                            0 σ 0 
                                                                                              (3.23)
                                             0 0 σ

Then,
                                       E = U ΣV R = U ΣV¯ ,                                    (3.24)

where V¯ = V R. Following Equation 3.24,

                               EE T = U ΣV¯ V¯ T ΣT U T = U ΣΣT U T                            (3.25)

Denote ΣΣT as D, then
                                      (EE T )U = U D = DU                                      (3.26)

is the eigen equation for EE T .


Note that such projection yields 4 possible solutions as follows,

                                ˆ = (U Rz (± π )ΣU T , U RzT (± π )ΣV T )
                           (Tˆ, R)                                                             (3.27)
                                             2                  2


In practice, to select the correct R and T out of 4 choices, we pick the one that gives positive depth
to each 2D feature when triangulated to 3D.


                                                 12
Chapter 4

Untangling scale in pose estimation

Let the relative pose of two views Ii and Ij be denoted by the rotation matrix Ri,j and by the
translation vector Ti,j = λi,j Tˆi,j where λi,j is the magnitude of Ti,j and Tˆi,j is the unit vector
representing the direction of Ti,j . This means that the expression of a 3D point Γ in the coordinate
frame of camera i, Γi , is related to its expression in the coordinate frame of camera j, Γj , as

                                        Γj = Ri,j Γi + λi,j Tˆi,j                                   (4.1)

It is a standard excercise in multiview reconstructions that two images containing a sufficient num-
ber of matching feautres give the essential matrix Ei,j which then gives Ri,j and Tˆi,j , leaving λi,j
as a free parameter. The free parameter implies a family of reconstruction which can scale linearly
as determinged by λi,j (the metric ambiguity of the reconstruction). A standard approach to visual


Figure 4.1: The proposed approach determines the relative pose of each pair of adjacent keyframes
from matching features, giving (n,n+1 , Tˆn,n+1 ) but leaving a free scale variable λn,n+1 for each pair
of frames. In this paper we show how to calculate scale for adjacent pairs of frames. We show that
scale-independent estimation of (n,n+1 , Tˆn,n+1 ) from the full set of features improves robustness.

odmetry matches features between two initial keyframes, say A1 and A2 , in Figure ??, giving both
the relative pose (R1,2 , Tˆ1,2 ), as well as a non-metric reconstruction in the form of a clould of 3D
                                                   13
points. Features are then tracked from keyframe A2 to adjacent non-keyframes up to and including
keyframe A3 . The correspondence between tracked features and their 3D reconstruction is used
in a standard PnP pose estimation [19] to give (R2,3 , λ2,3 Tˆ2,3 ), with no additional scale ambiguity
beyond the initial scale λ1,2 . The world is also enriched with new features and the relative pose
is adjusted incrementally. The pose of each new frame is determined using numerical optimiza-


Figure 4.2: Direct pose estimation between two frames based on direct matching of features is more
accurate than pose estimation based on tracking those features matched from a previous frame. We
use the ground truth scale in comparing PnP to direct estimation to illustrate the improvement in
isolation. This is shown for two distinct tracks.

tion to solve for (R2,3 , λ2,3 Tˆ2,3 ), (R3,4 , λ3,4 Tˆ3,4 ), We observe that the optimization to solve from
these six unknowns can be decomposed into estimating (Rn,n+1 ,Tˆn,n+1 ) and estimating λn,n+1 ,
since (Rn,n+1 ,Tˆn,n+1 ) can be computed directly with five points algorithm [8]. What remains is
the estimation of a single scale λn,n+1 . This has two distinct advantages. First, the estimation of
(Rn,n+1 ,Tˆn,n+1 ) should improve when we estimate it from the large number of matched features
between the two keyframes (A2 , A3 ), typically, 2000 features, as opposed to those matched features
of (A1 , A2 ) which can be tracked to A3 , say 500 features. This improvement is demonstrated in
Figure. 4.2. Second, the independent estimation of (Rn,n+1 ,Tˆn,n+1 ) leaves a single variable λn,n+1
between frames An and An+1 , i.e., λ1,2 , λ2,3 , λ3,4 , etc. We now show below that the scale between
each adjacent frames can be related to that of a previous adjacent pair of frames using only a single
tracked features: Consider three frames I1 , I2 and I3 , and three corresponding image points γ1 , γ2
and γ3 , one per frame, all arising from a single 3D point Γ. Let Ii and Ij be related by relative pose
Ri,j , λi,j as in Equation 4.1, i, j ∈ {1, 2, 3}, Figure 4.1. Assume (Ri,j , Tˆi,j ) are available but λi,j is
unknown. Then

                λ23   [(eT1 Tˆ21 ) − (eT3 Tˆ21 )(eT1 γ1 )][(eT3 R23 γ2 )(eT1 γ3 ) − (eT1 R23 γ2 )]
                    =                                                                                   (4.2)
                λ12   [(eT1 Tˆ23 ) − (eT3 Tˆ23 )(eT1 γ3 )][(eT3 R21 γ2 )(eT1 γ1 ) − (eT1 R21 γ2 )]

                                                       14
where eT1 = [1, 0, 0], eT2 = [0, 1, 0] and eT3 = [0, 0, 1].                   Let Γ1 = ρ1 γ1 and Γ2 = ρ2 γ2 , where ρ1
and ρ2 are depths of point Γ in camera 1 and 2 respectively. Then the inner product of coordinate
vectors eT1 , eT2 and eT3 with Equation 4.1 gives
                           
                                    T              T                   T ˆ
                            ρ2 (e1 γ2 ) = ρ1 (e1 R12 γ1 ) + λ12 (e1 T12 )
                           
                           
                               ρ2 (eT2 γ2 ) = ρ1 (eT2 R12 γ1 ) + λ12 (eT2 Tˆ12 )                                 (4.3)
                           
                            ρ (eT γ ) = ρ (eT R γ ) + λ (eT Tˆ ).
                           
                                     2   3    2       1   3    12 1           12       3   12


Since eT3 γ2 = 1, the last equation relates ρ2 to ρ1

                                             ρ2 = (eT3 R12 γ1 )ρ1 + λ12 eT3 Tˆ12 .                               (4.4)

Substituting this into the first equation of Equations 4.3 we have:

                                                      (eT1 Tˆ12 ) − (eT3 Tˆ12 )(eT1 γ2 )
                                     ρ1 = λ12                                                                    (4.5)
                                                  (eT3 R12 γ1 )(eT1 γ2 ) − (eT1 R12 γ1 )

For three images, we have two expressions for ρ2 from Equation. 4.5, one from (I2 , I1 ), and one
from (I2 , I3 ), ,

                                      ρ2 = λ21 T(e1 Tˆ21 )−(e  ˆ
                                                   T          TT      T
                                     
                                                              3 21 )(e1 γ1 )
                                               (e R γ )(eT γ )−(eT R γ
                                                              21 2        1            21 2 )
                                                          3           1            1                             (4.6)
                                      ρ2 = λ23 (eT1 Tˆ23 )−(eT3 Tˆ23 )(eT1 γ3 )
                                               (eT R γ )(eT γ )−(eT R γ
                                                          3   23 2    1   3        1   23 2 )


Then the ratio between two scales is

                     λ23   [(eT1 Tˆ21 ) − (eT3 Tˆ21 )(eT1 γ1 )][(eT3 R23 γ2 )(eT1 γ3 ) − (eT1 R23 γ2 )]
                         =                                                                                       (4.7)
                     λ12   [(eT1 Tˆ23 ) − (eT3 Tˆ23 )(eT1 γ3 )][(eT3 R21 γ2 )(eT1 γ1 ) − (eT1 R21 γ2 )]

 Since not just one but numerous triplets of features are available, each provides an estimate for


     (a)                                                        (b)
                                 n,n+1   λ
Figure 4.3: The distribution of λn−1,n for all features in the triple of image (In−1 , In , In+1 ) shown
in full in (a) and magnified in (b). Observe that while the distribution is narrow, there is a range
of possible values. Also note there are some outliers which can easily be identified. Disregarding
these outliers we can examine the rest of the distribution.

the relative scale, Figure 4.3. One straight forward approach to deciding the relative scale is to find

                                                               15
the average, peak, or a mean of this distribution. However, there is no clear bases for any of these
options. Rather, the optimal relative scale is one that is most consistent with the data, excluding
outliers. This calls for a RANSAC approach where the inlier set for each given relative scale is
measured. Since the range and sampling for relative scale cannot be known apriori, we use existing
samples of relative scale as the set of candidate relative scales. This one parameter RANSAC is
highly efficient. Figure 4.4 demonstrates the performance of this algorithms in determining the
relative scale with respect to ground truth.


                                                                 n,n+1               λ
Figure 4.4: The comparing between ground truth relative scale λn−1,n    and estimated relative scale
with our method. In this figure, our method can estimate correct relative scale between keyframes.
There are several wrong estimations, which will propagate.

Trinocular Reprojection Error:                   The traditional approach to trinocular consistency is to first
recover the 3D point by triangulation from two views, then reprojecting into a thrid view, and
measureing the reprojection error. For example, in Figure 4.5. two points γ1 and γ2 triangulate to
give Γ and the reprojection of Γ into a third view gives γ¯3 . The distance between γ3 and γ¯3 is the
trinocular reprojection error. The difficulty is that is rays from two corresponding points γ1 and γ2
rarely meet in 3D due to calibrations and discretization errors. As such Γ is a point minimizing the
distance from these non-intersecting rays.
     Alternatively, γ¯3 can be obtained by intersecting the respective epipolar lines correspding to γ1
and γ2 in the third view. this can be done analytically without actually intersection any epipolar
lines as an expression for γ¯3 is available.
              !                                       a3¯b1 −¯
                                       (a2 b1 −a1 b2 )(¯      a1¯b3 )−(a3 b1 −a1 b3 )(¯
                                                                                      a2¯b1 −¯a1¯b2 )
                                                                                                                     
            ξ                                         a3¯b1 −¯
                                      (a2 b3 −a3 b2 )(¯       a1¯b3 )−(a3 b1 −a1 b3 )((¯
                                                                                       a2¯b3 −¯
                                                                                              a2¯b3 )
    γ¯3 =       =  (¯a2¯b1 −¯a1¯b2 )−(¯a2¯b3 −¯
                                               a3¯b2 ) (a2 b1 −a1 b2 )(¯
                                                                       a3¯b1 −¯
                                                                              a1¯b3 )−(a3 b1 −a1 b3 )(¯
                                                                                                      a2¯b1 −¯
                                                                                                             a1¯b2 )
                                                                                                                        (4.8)
            η                (¯a ¯b −¯ a ¯b )          (a b −a b )(¯  a ¯b −¯ a ¯b )−(a b −a b )((¯   a ¯b −¯a ¯b )
                                    3 1    1 3             2 3    3 2   3 1    1 3       3 1   1 3     2 3    2 3


                                         ¯i = eTi R2,3 γ2 and ¯bi = eTi T2,3 . The formal proof is shown in
where ai = eTi R1,3 γ1 , bi = eTi T1,3 , a
supplementary material. This avoids an unnecessary compromise in accurracy as a result of casting
the problem as a 3D reconstruction and then a projection.
                                                                 16
Figure 4.5: (a) The 3D reprojection error is typically computed by first triangulating points and
then reporjecting into another view. (b) Alternatively, the point can be directly estimated without
triangulation, effectively as the intersection of epipolar lines.


Monocular Visual Odometry: The above method gives the relative scale between two adjacent
frame pairs. Given that (Rn,n+1 , Tˆn,n+1 ) are fully estimated independently, the only unknown is
                                                             λn,n+1
the set of scales λn,n+1 . The knowledge of relative scale   λn−1,n   reduces the set of unknowns to a
single unknown, e.g.λ1,2 = λ, where all others can be determined in cascade. There are techniques
for determining this global scale λ, for example estimating the ground plane and known camera
height [20]. However, observe that any error in any stage of this sequential process propagates
errors, compounding the error in each subsequent stage.
Stereo Visual Odometry: Alternatively, when stereo imaging is available, instead of a single frame


                    (a)                           (b)                   (c)              (d)


Figure 4.6: (a) Stereo visual odometry provides absolute scale and prevents propagation of scale
errors. (b),(c) different ways of establishing absolute scale. (d) regularization of absolute scale.

Ai we have a pair of frames from the left and right cameras, denoted by (A−    +
                                                                          i , Ai ). Observe that the


                                                17
scale of the pair A−      +
                   i and Ai is available due to prior stereo calibration. The use of absolute scale
from the stereo pair, decomposes the problem of scale determination into an independent set of local
scale problem. This avoids a global, cascaded scale determination, thus preventing propagation and
compounding of errors. Observe that this can be done in a ”forward”, Figure 4.6(b) or ”backward”
manner, Figure 4.6(c), allowing for a regularization of local scale bond two sources. In addition,
the continuity of absolute scale can be gauged by measuring relative scale in a sequential fashion,
Figure 4.6(d).


                                                18
                                                             (b)


     (a)
                                                             (c)


                                                           (e)


   (d)
                                                           (f)

Figure 4.7: The result of KITTI00. (a) Output path. (b) Comparison of rotation error. (c) Compar-
ison of translation error. The result of KITTI07. (d) Output path. (e) Comparison of rotation error.
(f) Comparison of translation error.


                                                19
Chapter 5

Implementation Details

The system is implemented in C++. It consists of the following major modules. Note that some
open source libraries introduced includes opencv, g2o, eigen, lmmin, etc.


ViewReader This is a class that reads in images captured from cameras. It is supposed to orga-
nize each view as a pair of stereo images.
FeatureExtractor This is a class that extracts features, according to specification of which type of
feature, and compute the descriptor of corresponding feature.
FeatureTracker FeatureTracker tracks feature along frames. A decision of a good match or a bad
one is made here.
ViewTracker This class tracks all frames and landmarks. The 2D-2D, 2D-3D correspondences are
well maintained here and available for lookup at any time efficienty.
PoseEstimator This class implements different pose estimation methods.
View This is a base class that defines a view. A view contains a pair of images, features in the
images, feature ids, as well as the pose of this view.
Feature This is a base class that defines a feature. A feature contains its type (SURF, ORB, etc), its
descriptor, and its id that relates it to the corresponding 3D landmark.
Converter This is a helper class. It is responsible for the conversions between different data types.
As various open source libraries are introduced in the implementation, it is useful to have a con-
verter that eliminates the gaps between different libraries.
Canvas This is a helper class. It is responsible for all the drawings. For instance, it draws feature
correspondences connected with lines on images and output to user window.


                                                  20
Chapter 6

Experimental Results

Dataset and Experiments:         Evaluation was performed on the KITTI benchmark [21], where
datasets are captured at 10Hz by driving around the city covering scenes from urban area, rural
area, and highways. Our method is evaluated on tracks 00-10 with the exception of 01, which
does not contain sufficient trackable features, and 04, where no obvious rotation was available for
validation purpose. Some of the results are reported here, while others will be accessible in the
supplementary materials.
Results and Discussion: KITTI benchmark contains accurate ground truth of both rotation and
translation frame by frame, therefore allows us to compare approaches in details. Note that in order
to be focused, no other further refinement, such as bundle adjustment or loop closure, are introduced
to either method.
    Figure 4.7 shows the output path from different approaches. In general dissecting scale performs
much better than the classical PnP method.
For the interest of more insight, we demonstrate the rotation error and translation error of different
approaches, including utilizing trinocular projection error, in Figure. 4.7 as well. As the figure il-
lustrates, scale-decoupled pose estimation reduces the error of classical PnP approach. Moreover,
using trinocular reprojection error instead of popular triangulation and reprojection error gives slight
improvement. In fact, trinocular reprojection performs slightly better in all sequences we experi-
mented on, thus convinces us its robustness over triangulation and reprojection.
 Failure modes 1. The estimation of essential matrix could go wrong. A typical incorrect esti-
mate of essential matrix E defines an epipolar geometry illustrated by 6.3. Note that when the
camera is moving forward, as most of the cases are with KITTI dataset, the epipole is supposed to
locate within the image. However, an incorrect estimate sometimes finds the epipole off the side of
the image. This is exactly what 6.3 is showing: those nearly parallel lines intersect at a potential


                                                  21
Figure 6.1: A comparison of scale/speed estimation of the PnP-based algorithm (left column) and
ours (middle column) against the ground truth. The distribution of errors are plotted in the right
column. Rows corresponds to tracks 00 and 07 of KITTI.


Figure 6.2: All tracks we experimented on. Left: Average translation Error. Middle: Average
Rotation Error. Right: Average Speed Error.


epipole to the left of the image. Such an incorrect pose estimate usually produces a horizontal jump
in the output path. Moreover, it could potentially impact the poses of future frames, as the path is
integrated frame by frame.
2. The estimation of relative pose is based on the assumption that most of the features in two con-
secutive frames move in same direction and that the direction reflects the real motion of the camera.
However this is not always true. Consider the following circumstance, where a car halts at a stop
sign, and other cars pass by horizontally. This is not under the assumption above and might estimate
a lateral motion. See 6.4 for an example. The truck in the images carries most of the features in the
lateral motion.
3. Further works need to be done in order to run odometry on BlindFind dataset. One of the
BlindFind sequence is captured in the Providence City Mall. Although we don’t have accurate

                                                 22
ground truth as it is in KITTI, it is known that the track is a loop that closes itself at the very end.
This becomes a good way to validate the experiment result. However, at this point the system fails
on this dataset. One of the reason is that at some point the number of features drops too quickly.
This might have something to do with the quality of the image, or blurring. Figure ?? shows an
example of feature extraction and matching around the failure area. More investigation is called for
so as to figure out what happened and resolve it.
4. BlindFind dataset has another issue that leads to insufficient features. In a turning scenario, the
images get blurred as in Figure ??.


Figure 6.3: The epipolar lines and epipole from ground truth pose, the ones that intersect at the
center of the image, and those nearly parallel lines from algorithm estimate.


                                                    23
                     Figure 6.4: This depicts an example of failure mode 2.


Figure 6.5: (a) The features extracted from last frame. (b) After matching only very few of them
stay.


                                              24
Figure 6.6: The images are blurred and not enough features could be tracked.


                                    25
Chapter 7

Conclusions And Future Works

Visual odometry has become one of the most popular topic in computer vision research and appli-
cations, as it produces very useful information from very low-cost hardware: cameras. The key to
visual odometry is estimating the relative pose between two consecutive frames. In this thesis a
contribution to visual odometry over classical methods is untangling the scale estimation from other
5 degrees of freedom: rotation and unit translation. It has been proven in the chapters above that
this approach achieves a better result compared to classical PnP-based methods when computing a
relative pose with information limited to a local neighborhood.
However, lots of related works bring the problem to a more global scope using global bundle adjust-
ment and loop closure techniques to enhance the performance. A potential future work on this thesis
could be a refinement on the pose estimations using the information from a larger neighborhood.
Current approach has one disadvantage that a single incorrect estimate of rotation or translation
could lead to a drift in the output path since then. Although a scale-dissected pose estimation is
proven to be more accurate, it should be given a second chance to correct itself. Consider three con-
secutive frames Ii , Ij and Ik , where relative poses (Rij , Tij ) and (Rjk , Tjk ) are already computed.
As the relative pose between Ii and Ik , (Rik , Tik ), could be derived by combining (Rij , Tij ) and
(Rjk , Tjk ), an optimization can be done within these three frames by tweaking these three relative
pose estimates slightly to minimize the re-projection errors. One of the reasons why this optimiza-
tion would work is that we have proven the relative pose estimation from monocular technique is
very accurate in most cases. Therefore it is hardly likely to see two incorrect estimates out of these
three. Optimizing within them can be expected to refine the poses. In addition, if common features
exist in a larger local neighborhood, more than three frames can be taken into this optimization
scheme.


                                                   26
Further more, as right now only the left images are taken into account when estimating a rela-
tive pose from monocular technique, another attempt to enhance the performance is to include the
information from right images into consideration as well. One idea could be minimizing the re-
projection error in both left and right images to optimize the relative pose. Even a simple average
of the estimate might reduce the error.


                                                27
Bibliography

[1] Mark Maimone, Yang Cheng, and Larry Matthies. Two years of visual odometry on the mars
    exploration rovers. Journal of Field Robotics, 24(3):169–186, 2007.

[2] Gim Hee Lee, Friedrich Faundorfer, and Marc Pollefeys. Motion estimation for self-driving
    cars with a generalized camera. In Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, pages 2746–2753, 2013.

[3] Albert S Huang, Abraham Bachrach, Peter Henry, Michael Krainin, Daniel Maturana, Dieter
    Fox, and Nicholas Roy. Visual odometry and mapping for autonomous flight using an rgb-d
    camera. In Robotics Research, pages 235–252. Springer, 2017.

[4] Bernd Kitt, Andreas Geiger, and Henning Lategahn. Visual odometry based on stereo image
    sequences with ransac-based outlier rejection scheme. In Intelligent Vehicles Symposium (IV),
    2010 IEEE, pages 486–492. IEEE, 2010.

[5] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model
    fitting with applications to image analysis and automated cartography. Communications of the
    ACM, 24(6):381–395, 1981.

[6] Christian Kerl, J¨urgen Sturm, and Daniel Cremers. Robust odometry estimation for rgb-d
    cameras. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages
    3748–3754. IEEE, 2013.

[7] David Nist´er, Oleg Naroditsky, and James Bergen. Visual odometry. In Computer Vision
    and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society
    Conference on, volume 1, pages I–I. Ieee, 2004.

[8] David Nist´er. An efficient solution to the five-point relative pose problem. IEEE transactions
    on pattern analysis and machine intelligence, 26(6):756–770, 2004.


                                               28
 [9] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cam-
     bridge university press, 2003.

[10] Andreas Geiger, Julius Ziegler, and Christoph Stiller. Stereoscan: Dense 3d reconstruction in
     real-time. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pages 963–968. Ieee, 2011.

[11] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and
     accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.

[12] Georg Klein and David Murray. Parallel tracking and mapping for small ar workspaces. In
     Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Sympo-
     sium on, pages 225–234. IEEE, 2007.

[13] Alessandro Chiuso, Paolo Favaro, Hailin Jin, and Stefano Soatto. 3-d motion and structure
     from 2-d motion causally integrated over time: Implementation. Computer VisionECCV 2000,
     pages 734–750, 2000.

[14] Igor Cviˇsi´c and Ivan Petrovi´c. Stereo odometry based on careful feature selection and tracking.
     In Mobile Robots (ECMR), 2015 European Conference on, pages 1–6. IEEE, 2015.

[15] Davide Scaramuzza, Friedrich Fraundorfer, and Roland Siegwart. Real-time monocular vi-
     sual odometry for on-road vehicles with 1-point ransac. In Robotics and Automation, 2009.
     ICRA’09. IEEE International Conference on, pages 4293–4299. Ieee, 2009.

[16] Davide Scaramuzza, Andrea Censi, and Kostas Daniilidis. Exploiting motion priors in visual
     odometry for vehicle-mounted cameras with non-holonomic constraints. In Intelligent Robots
     and Systems (IROS), 2011 IEEE/RSJ International Conference on, pages 4469–4476. IEEE,
     2011.

[17] Menglong Zhu, Srikumar Ramalingam, Yuichi Taguchi, and Tyler Garaas. Monocular visual
     odometry and dense 3d reconstruction for on-road vehicles. In Computer Vision–ECCV 2012.
     Workshops and Demonstrations, pages 596–606. Springer, 2012.

[18] Yanhua Jiang, Huiyan Chen, Guangming Xiong, and Davide Scaramuzza. Icp stereo visual
     odometry for wheeled vehicles based on a 1dof motion prior. In Robotics and Automation
     (ICRA), 2014 IEEE International Conference on, pages 585–592. IEEE, 2014.

[19] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o (n) solution
     to the pnp problem. International journal of computer vision, 81(2):155–166, 2009.


                                                  29
[20] Bhoram Lee, Kostas Daniilidis, and Daniel D Lee. Online self-supervised monocular visual
     odometry for ground vehicles. In Robotics and Automation (ICRA), 2015 IEEE International
     Conference on, pages 5232–5238. IEEE, 2015.

[21] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the
     kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
     Conference on, pages 3354–3361. IEEE, 2012.


                                              30