Line segment tracking for 3D reconstruction

The VCV projects are related to pattern recognition, artificial vision and database ...... This camera maps the 3D points X to 2D points x according to. PX x =.
1MB taille 1 téléchargements 362 vues
Research Circle Niskayuna, NY 12309 USA

Line segment tracking for 3D reconstruction

Report submitted for the third year of Diplôme d’Ingénieur en Informatique et Communication Laurent Guichard August 2001

Supervsiors: Nicolas Dano (GE CRD) Luce Morin (IRISA)

Acknowledgements First, I wish to thank Nicolas Dano who managed to make me join the GE Research & Development Center in the United States and has been an attentive supervisor. I am grateful to Geoff Cross and Peter Tu who have been really patient and kind with me. And also, I thank Don Hamilton, Joe Mundy, and all the members of the Image Understanding group for their patience, help and advice. They have all been nice with me at work and also outside work where they have participated in my integration in this huge country. Finally, I would like to thank Patricia Bourdel, Mael Dano and all the people of General Electric for their kindness.

2

Abstract 3D Reconstruction from images (or videos) has been an intensive research topic for the last decade. Recently, the problem has been solved in theory for ideal cases. However, practical implementations are often unstable. GE CRD implemented a scene reconstruction application that uses 2D points tracked in the sequence in order to compute camera parameters, 3D point locations, and then build a virtual model. This report describes the integration of line segment use in the reconstruction process. We expose our line segment tracking algorithm in a video sequence and the methods used to reconstruct those segments in 3D. The interest of line segments reconstruction is highlighted by using them in practical circumstances such as meshing and dense matching. Key words: 3D Scene Reconstruction, video, structure from motion, lines.

Résume La reconstruction 3D a partir d’images (ou d’une séquence vidéo) a été un sujet de recherche intensif depuis ces dix dernières années. Récemment, le problème a été résolu en théorie pour certains cas idéaux. Mais les implémentations se sont révélées instables. Le centre de recherché de GE a développé une application de reconstruction de scène qui utilise les points détectes et suivis dans une séquence vidéo pour calculer les cameras, la position des points 3D et enfin construire un modèle 3D. Ce rapport décrit l’intégration des segments dans le processus de reconstruction. Nous présentons notre algorithme de suivi de segments dans une séquence vidéo et les méthodes utilisées pour reconstruire ces segments en trois dimensions. L’intérêt de la reconstruction des segments est mis en évidence par leur utilisation dans 2 cas pratiques: le « meshing » et le « dense matching »

3

Table of contents I. Introduction ............................................................................................. 5 II. The Company ....................................................................................... 6 A. Overview of General Electric............................................................. 6 B. The Visualization and Computer Vision Group at the Research and Development Center.................................................................................... 6 III. Project context and background ............................................................. 7 A. VXL .................................................................................................. 7 B. Background Geometry ...................................................................... 7 1. 2. 3.

Notation....................................................................................................................7 Camera model .........................................................................................................7 Epipolar geometry..................................................................................................8

IV. Feature Tracking ................................................................................... 9 A. Points................................................................................................ 9 1. 2.

B. 1. 2. 3.

Kanade-Lucas algorithm ......................................................................................9 Points weeding.......................................................................................................11

Segments......................................................................................... 16 Edge detection: Canny Edge Detector.............................................................16 Segment fitting ......................................................................................................17 Segment matching ................................................................................................18

V. 3D Reconstruction .............................................................................. 27 A. Projective reconstruction................................................................. 28 1. 2. 3. 4. 5.

B. 1. 2.

C. D. E. 1. 2. 3.

VI.

The trifocal tensor................................................................................................28 The camera matrices ...........................................................................................31 The 3D points ........................................................................................................31 The 3D segments...................................................................................................33 The reconstruction ambiguity...........................................................................38

Affine reconstruction ...................................................................... 40 Quasi-affine reconstruction ...............................................................................40 Affine reconstruction...........................................................................................43

Euclidean Reconstruction ............................................................... 43 Meshing.......................................................................................... 43 Dense matching .............................................................................. 44 Image rectification ...............................................................................................45 Disparity map ........................................................................................................46 Segment constraints.............................................................................................47

Conclusion and Future Work ............................................................... 49

4

I. Introduction The first stage of the reconstruction of scene from a video sequence is to find a relation between images. This is accomplished by detecting and tracking features in the image sequence. The most general approach is to use 2D points tracked in the images. As in the man made world contains objects whose boundaries are defined by lines, the use of segments in the reconstruction process will bring results a step further. The first part of this report presents the tracking of feature in an image sequence. Points as it is usually done using Kanade Lucas and line segments using our novel approach. The second part describes the use of the tracked features in two kind of reconstruction: the meshing and dense matching.

5

II. The Company A. Overview of General Electric General Electric (GE) is the result of a merger in 1892 of the Edison General Electric Company (the Edison Electric Light Company established by Thomas A. Edison in 1878) and the Thomson-Houston Electric Company. It used to manufacture electrical motors and products related to electricity. For example, in 1895, GE built the world’s largest electrical locomotives (90 tons) and transformers (800 kw). In 1919, GE developed turbine-driven supercharger enabling Lepere Biplane to set a record of 137 MPH at 18,400 foot altitude. And in 1922, WGY, GE’s radio station, the first American radio, went on the air in Schenectady, NY. Six years later, WGY broadcast the first television drama using a technique developed by Ernest F.W. Alexanderson. Today, GE is no longer a simple “electric” company. It is a diversified technology, manufacturing and services company, leading in most of its twelve businesses – electrical motors, locomotive and rail equipment, plane engines, medical equipment, electrical turbines, plastics, lighting, industrial control systems, appliances, capital services, information systems, television. GE operates in more than 100 countries around the world, including 250 manufacturing plants in 26 different nations, and employs over 200,000 people worldwide, including 165,000 in the United States. Twice bigger than Elf, the leading French company, GE is the leading American company and is ranked second worldwide behind Shell. It is worth over 90 billion dollars divided between its businesses.

B. The Visualization and Computer Vision Group at the Research and Development Center The GE Research & Development Center was created in 1965 by the merger of the GE Research Laboratory (founded in 1900) and the GE Advanced Technology Laboratory (established in 1900). It is one of the world’s largest and most diversified industrial laboratory. Twelve laboratories support the efforts of research groups around the world associated with the GE’s twelve global businesses: the Ceramics Lab, the Characterization and Environmental Technology Lab, the Chemical Process Technology Lab, the Control Systems and Electronic Technology Lab, the Electronic Systems Lab, the Engineering Mechanics Lab, the Industrial Electronics Lab, the Information Technology Lab, The Manufacturing and Business Process Lab, the Mechanical System Lab, the Physical Metallurgy Lab, the Polymer Materials Lab. The Visualization and Computer Vision Group (VCV Group) is one of the six groups which compose the Electronic Systems Laboratory. The VCV projects are related to pattern recognition, artificial vision and database management.

6

III. Project context and background A. VXL The VCV Group, the Oxford University and Leuven University are developing a new large open-source image understanding environment based on Target Junior. It is designed both as a computer vision research tool and as a platform for building end-user application. It is called VXL for Vision Everything Library. VXL is composed of several packages. The ones that were most used for this project are: • VCL, for C++, handles the STL libraries, the differences between compilers. • VNL, for Numeric, contains mathematical entities and functionalities, such as complex numbers, matrices, optimization algorithms. • MVL contains objects and algorithms for computer vision such as fundamental matrices and epipoles computation. • OSL, for segmentation, contains the Canny-Oxford and the lines fitting algorithms. • VIL, for Image, handles the algorithms to load, store, manipulate and save images. • VGUI, for graphic user interface, offers methods to display and manipulate images.

B. Background Geometry We assume here that the reader has some knowledge in computer vision. We just recall the main definitions and relations.

1. Notation This report employs a standard notation, which is consistent throughout the work: • Vectors are denoted by a lower-case bold symbols (e.g. v) and matrices by type- face symbols (e.g. M); • 3D points are denoted by upper-case bold symbols (e.g. X); • 2D points and image points are denoted by lower-case bold symbols (e.g. x). We introduce new notations: • Matches between a feature in one image and a feature in another image are denoted by a two 2D points between brackets (e.g. x, x ′ ). By extension, matches in n images are denoted x, x ′,..., x ( n−1) . •

[

]

Segments are denoted by two 2D points between square brackets (e.g. x f , x s ).

2. Camera model The most general linear camera model is known as a central projection camera (pinhole camera). A 3D point in space is projected onto the image plane by means of straight visual rays, all passing through the focal or camera center. The corresponding image point is given by the intersection of the image plane and the visual ray. So, a 3D point (X, Y, Z ) is projected onto the image point (x, y) by means of the homogenous equation:

7

X   Y P  Z   1  where the equality is only up to a scale, and the 3x4 projection matrix P represents the camera parameters. x    y = 1  

3. Epipolar geometry The epipolar geometry refers to the projective geometry linking two views. Here, we briefly summarize the standard results. The reader can find more details in [2]. Epipolar transfer Consider two cameras viewing a single 3D point, X. Epipolar plane X x’

x l e

l’ e’

C

C’ Figure III-1: Point correspondence geometry

Figure III-1 shows that there is a plane, called the epipolar plane, which passes through the two camera centers, C and C’, and X. This plane intersects the image planes in a pair of corresponding epipolar lines, l and l’. The images of X, x and x’, must lie on the two epipolar lines l and l’ respectively. To conclude, if we know the epipolar geometry corresponding to 2 given cameras, we can deduce on which epipolar line l’ is the corresponding point x’ of an image point x in the first image. Fundamental matrix The relation between two images of a pair of corresponding 2D points is best explained by the fundamental matrix equation: x ′T Fx = 0 where F is a 3x3 matrix.

8

IV. Feature Tracking When trying to reconstruct a scene from an image sequence, we need to compute relations between the views. This means that we must know the changes that have been applied to the camera between images I and I+1. To do so, we have to link the image I and the image I+1 by looking at their similarities and their differences. If we can track specific regions from image I to image I+1, we are able to compute the parameters of the cameras. The feature that is the most commonly used (i.e. tracked) is the point. So, we present the way it is done in the first part of this chapter. Those algorithms existed in Target Jr and we had to understand them and implement them in vxl. As we said in the introduction, we wanted to use a new kind of feature to improve the reconstruction. Thus, in the second part og this chapter, we introduce the segments as features and the algorithms especially created to achieve their tracking in the sequence.

A. Points The reason we use points is that they are easy to detect and to represent mathematically and that a point in an image corresponds directly to its three-dimensional entity in space. To track the points, we used a well-known algorithm, called Kanade- Lucas.

1. Kanade-Lucas algorithm We can represent a 2d point in image t by ( x, y, t ) . Thus, the aim of the algorithm is to find the new location of all the points ( x, y, t ) in the next image ( t + 1) . We can write: ∀( x, y, t ), I ( x, y, t + 1) = I ( x − dx, y − dy, t ) The Kanade-Lucas algorithm will not track all the pixels of the image. It first extracts the most relevant points in the first image of the sequence. Then it tries to find their matches in the second image. And so on for the rest of the images. A point ( x, y, t ) is matched with a point ( x' , y' , t + 1) by looking for the best correlation between the window around the point ( x, y, t ) and the window around the point ( x' , y' , t + 1) . So, for a number of points ( x, y) that have been tracked in the sequence, we get displacement vectors between the images t and t + 1 , ∀t . We downloaded an implementation of the Kanade- Lucas algorithm on internet and created an interface with our own code. More information and details can be found in [01]. Below are the results we obtain with the Kanade-Lucas algorithm for only 2 images. The parameters of the search were set to look for 1000 points.

9

Image 0

Image 1 The blue segments represent the displacement vectors. 10

2. Points weeding The output with the Kanade-Lucas algorithm gives a lot of outliers. Thus, we worked on weeding methods to keep the good matches only. From the image 0 above, we have zoomed in a small region. When we look closely at the displacement vectors, we can see that they have on average all the same directions and lenght. This means that there is a general movement in the image. But, for the points marked with an arrow, their displacement vector is completely different. We can conclude that the Kanade-Lucas algorithm has failed to match correctly those points.

We have to remove those wrong matches by a weeding. To do so, we must keep in mind that most of the points are well matched. This means that we have many inliers (i.e. the true matches) and few outliers (i.e. the wrong matches). We just have to find a way of distinguishing the inliers from the outliers. If we have a constraint that is respected only by the inliers, we will be able to weed the outliers. We used two different kinds of constraint. The homography constraint A homography is the transformation of 3D plane from one image to the other. We can consider the scene as mainly planar and allow a parallax effect which correspond to the relief. All the pairs of 2D matches corresponding to 3D points which are too far from this plane will not respect the homography constraint. This transformation will roughly transform all the points from one image to the other. Let us take x , a 2D point in image t and x ' , his match in image t + 1 . We want to find a transformation that can be written in homogeneous coordinates:

11

∀ x , x ' , x ' = Hx

 h 0,0 h 0,1 with H =  h 1,1 h 1,1  h 2,0 h 2,1  T with x = ( x, y, w ) T

h 0,2   h 1,2  h 2,2 

with x 'T = ( x' , y' , w' ) T Each match xi , xi ' gives us three equations. So, we have the system:  x i ' = x i .h 0,0 + y i .h 0,1 + w i .h 0,2   y i ' = x i .h 1,0 + y i .h 1,1 + w i .h 1,2  w ' = x .h + y .h + w .h i 2,0 i 2,1 i 2,2  i With n matches, we have a system of 2*n equations. To solve this system, we prefer to write it this way: T h T = [h 0,0 h 0,1 h 0,2 h 1,0 h 1,1 h 1,2 h 2,0 h 2,1 h 2,2 ]

 x 0 .w 0′  0   M  x .w ′ A=  i i  0  M   x .w ′  n −1 n −1 0 

y 0 .w′ 0

w 0 .w ′0 0

0 x 0 .w ′0

y i .w′i 0

w i .w ′i 0

0 x i .w ′i

0 y 0 .w ′0 M 0 y i .w i′

0 x n−1.w ′n −1

M 0 y n −1 .w ′n−1

y n−1.w ′ w n−1.w ′n −1 0 0

0 w 0 .w′0

− x 0 .x ′0 − x 0 .y′0

− y 0 .x 0 − y 0 .y′0

0 w i .w′i

− x i .x ′i − x i .y′

− y i .x i − y i .y′i

0 w n−1 .w′n −1

− x n−1.x ′n −1 − x n −1 .y′n −1

− y n−1.x n −1 − y n−1 .y′n−1

− w 0 .x ′0 − w 0 .y ′0 M − w i .x ′i − w i .y ′i

        M  − w n−1.x ′n −1   − w n−1.y ′n−1 

The system to solve now is: A.h = 0 The vector h has 6 unknowns, so: • If n < 4 then the system is insoluble. • If n = 4 then there is one solution (except if A is not full-rank). • If n > 4 then generally there is not an exact solution to the over-determinated system. In this case, we minimize the norm Ah with the constraint h = 1 to avoid the solution h = 0. There are many non- linear algorithms to minimize this norm based on different cost functions. The method uses in our case is a linear one and needs only the computation of the SVD of the matrix A (cf. annex A). Now, suppose we have computed our homography matrix only on true matches. To accept or reject a match, we must have a tolerance ε , due to the errors of measurement of the point’s position and due to the relief of the scene. 2 So, we form difference d = x'−h.x and compare it to the threshold ε .

12

If d < ε then the match x, x ′ belongs to the inliers. Otherwise, it is an outlier. The value of the threshold is found empirically and may have to be adjusted for a new sequence of images. The problem of this method is that it does not take in account the fact that the scene, which has been recorded, is in 3 dimensions. X

π

x’

x

xt ’ H

e

C

e’

C’

Figure IV-1: A homography is a 2D transformation

The constraint x ′ = Hx will be respected exactly only for 3d points lying on the plane π . Otherwise, the 2d point xt′ , computed with the equation above, will be on the good epipolar line but at the wrong place (here, Hx = x ′t ). It is a huge limitation of the homography constraint. It works well for almost planar scenes but is difficult to apply on general scenes. First, because we have to guess the empirical threshold and second because for the scenes with a high relief, we are obliged to take a bigger threshold which is unfortunately tolerant to a higher noise. The fundamental constraint This constraint comes from the fundamental matrix equation, which is: x ′T Fx = 0 Knowing the fundamental matrix, all the matches x, x ′ may satisfy this constraint. This relation gives the equations of two epipolar lines: T T l = x′ F l ′ = Fx ′ l in the first image and l in the second image.

d x

l

l′

e Image 0

d ′ x′

e′ Image 1

Figure IV-2: The orthogonal distance between an epipolar line and a point

13

The distance d ( x, l ) and the distance d ( x′, l ′) are respectively called d and d ′ . We use those distances to determine if a match is correct. The ideal case would be d = 0 and d ′ = 0 . As the two images may be noisy because of the imperfection of the imaging devices and as the Kanade-Lucas algorithm may match points that are not exactly corresponding, we have to tolerate a small error. Thus, we introduce a threshold ε . The relationship is: d + d′ < ε A match x, x ′ that satisfies this inequality belongs to the inliers. The advantage of using this constraint instead of the homography constraint is that it considers the depth of the recorded scene. The computation of the fundamental matrix is done the same way as the homography matrix. We do not give here the details of the algorithm because we did not have to implement it. Just say that it requires at least 7 true matches to be computed. Robust estimation In both case, we have supposed that the matrix (homography or fundamental) was known and we used it to test if a match was false or true. But, in reality we do not dispose a priori of this matrix. And, we cannot compute it with the matches we have from Kanade-Lucas because some are false which would disturb the computation of the estimated matrix. This kind of situation is really adapted to use a robust estimation called RANSAC. The goal of this method is to determine the inliers so that the matrix (homography or fundamental) can be estimated only from those inliers using the algorithms described above. Here is the RANSAC algorithm: • Repeat k times: 1. Randomly select a subset of n matches in the set of N matches, 2. Compute the homography matrix or fundamental matrix with this subset, 3. For each match xi , x′i , compute the distance: Homography: Di = xi '−h.x i

2

Fundamental matrix: Di = d i2 + d i′ 2

• •

4. Compute the numbers of inliers: xi , x′i is an inliers if d i < ε 5. Store the number of inliers Loop The best H or F is the one which leads to the largest number of inliers. Re-estimate H or F with all the inliers

For the homography matrix, the size of the subset is 4. For the fundamental matrix, the size of the subset is 7 or 8 (computation is easier with 8 matches). This method is quite robust and is able to find a solution even when more than 50% of the data is outlier. This is a main advantage over other methods such as Least Median Square. 14

Results Below are the results of this method. For the homography constraint:

Many points that have been weeded were true matches. The remaining matches are located around a principal plane. All the points belonging to this plane satisfy the homography constraint x ′ = Hx (without considering the noise due to video capture). For the fundamental constraint:

The numbers of matches is greater than in the case of the homography constraint. And, it is not just a matter of threshold. It’s really due to the property that the fundamental matrix takes in account the depth of the scene. 15

B. Segments Tracking segments in an image sequence is made of three stages: • Edges detection, • Segment fitting, • Segment matching.

1. Edge detection: Canny Edge Detector To detect the edges in the images, we use a widespread algorithm called Canny. It does not deal with video sequence. So, we had to run it on each image separately. More information and details can be found in [1 ]. Here, we explain its main steps: • Convolute the image with a Gaussian filter to smooth it so that noise, due to video capture, will not disturb the next step. • Convolute the smoothed image with the 2D first-derivative approximation of a Gaussian filter to highlight the details of the image. • Run a non- maximum suppression on the gradient magnitude obtained from the precedent step in order to thin the edges. • It performs then a hysteresis thresholding with two thresholds T1 and T2 (T1>T2). If a pixel is above T1, it is kept. If a pixel is between T1 and T2 and “can be connected to a pixel above the higher threshold directly, or through a path consisted of pixels all above the lower threshold” then this pixel is kept. If a pixel is below T2, the pixel is rejected. For each image, we perform those steps. The three main parameters we had to adjust are: • The width of the smoothing Gaussian kernel: σ • The lower and the upper hysteresis thresholds: T 2 and T 1 We have obtained the following results:

Image 0

Image 1

The yellow strokes correspond to the detected edges. The parameters: σ = 1 pixel, T 2 = 2, T 1 = 12

16

2. Segment fitting Now, we have a set of edges for an image. We will approximate each edge by one or more segments. a Consider an edge.

b We arbitrarily start from point a and we follow each point of the edge by going to b . All the points of the edge are denoted x i and so the edge can be represented by all its points: a = x0 → x1 → K → x i → K → x j → K → x n−1 = b Below is the segment fitting algorithm: i=0 For j = i to n − 1 do  xi   x i    Form the matrix A =  M  =   x  x  j  j

yi M yj

1   with the points of the edge. 1 

 lx    Find the line l that approximates the best the points by solving: Al = 0 ≡ A l y  = 0 l   w If RMS( l , x i , …, x j ) > ε Then

Else EndIf

Create_Segment( l best , x i , x j−1 ) i= j l best = l

EndFor j 1 d (l , x k ) 2 j −i +1∑ k =i The method Create_Segment( l best , x i , x j−1 ) create a segment by projecting the points x i and

The function RMS(Root Mean Square) returns: RMS( l , x i , …, x j )=

x j−1 onto the line l best . Those two projected points form the endpoints of the segment.

The images below show the segments fit (in red) by segment fitting algorithm from the Canny edge detector output (in yellow), implemented in vxl.

17

3. Segment matching a) Problem Statement At this point, we have to match the segments between the images. To facilitate the reader’s comprehension, we will deal with only two images (Image 0, the first image and Image 1, the second image). Considering a segment in image 0. To find its corresponding segment, we can compare it to all segments in the second image and decide that the best match is the one that minimizes a certain criterion (or cost function). This is the exhaustive method, which can be cost prohibitive if we have, for example, 200 segments in each image. Indeed, it would lead to compute the criteria 200 × 200 = 40000 times. Moreover, the points matching algorithm can give us a relationship between those two images (the homography matrix or the fundamental matrix); relationship that we could use to decrease the numbers of comparison between segments. Indeed, knowing the homography matrix between image 0 and image1, we can have a rough idea of where the corresponding segment in the second image should be if we know its location in the first image. We transform the current segment in image 0 by this matrix and obtain a new segment in image 1. Let us have a segment s in image 0 and its correspondent s’ in image 1. As we have the endpoints of the segment s , denoted x f , x s , we transform them by H so that we get a new

[

] [

[

]

]

segment s t = H .x f , H .x s = x tf , xst in image 1, called transformed segment. Now, the transformed segment s t in image 1 has not exactly the same position and orientation as the segment s ′ because the homography matrix is a 2d transformation, as we said, that gives approximately the position of points. Our goal now is to match the segments in image 0 (segments s ) with the segments in image 1 (segments s ′ ) using the transformed segments (segments s t ). b) Preliminary weeding First, we define two criteria to keep only the most probable matches for a segment s : • If the angle between the segments s t and s ′ is above a threshold, the segment s ' is removed from the possible matches for the segment s . • If the end-points of the segment s ′ do not belo ng to a window defined around the segment s t or the segment s ′ does not cross this window, it is removed too.

18

Image 1

Image 0

s2′

s2

s1

s1′

s0′

s0

s0t

0

H

s0 is transformed by the homography to a segment s0t . Around this segment, we define a window (in dashed red) and all the segments that belong to this box are accepted (i.e. s0′ and s2′ ). The others are rejected (i.e. s1′ ). This is a rough weeding but efficient. Figure IV-3: The segment

At this point, for most of segments s in image 0, we have found a subset of segments s ′ in image 1 that possibly are matches. We need to define another criteria to make a choice between the remaining candidates. Here are two methods that we have tried; the results of those methods are given further: c) Weeding : The geometric distance criteria We compute the orthogonal distance and the overlap distance between the current segment s t (transformed from the current segment s) and all the remaining segments s ′ . orthogonal dist The match s , s ′ that minimizes the quotient is the best one. overlap dist When trying to compute those distances, we encounter different cases depending on the segments involved: d 1ortho

d 1ortho

d 1ortho

1

d 1over

s

t

s′

d 1over d

2 over

s

t

s′

d 1over d

2 over

st

s′

s′

st

1

1

d

d 1over

2 over

d 1ortho 1

d

2 d ortho

2 ortho

The segment

s′

is totally

projected onto segment

s

s t is totally projected onto segment s ′ The segment

t

2 d ortho

The segment

2 d ortho

s′

is partially

projected onto segment

s

t

2 d over

s ′ is partially t projected onto segment s The segment

19

1 2 The overlap distance corresponds to d over = d over + d over . 2 The orthogonal distance corresponds to d ortho = d 1ortho + d ortho . The best match between the segment s and the segments s ′ is the one that minimizes the d orthogonal distance and maximizes the overlap distance. So, by computing ortho and looking d over for the smallest quotient, we ensure that those 2 constraints are respected. Also, it avoids us having to use weights to adjust the respective importance of each distance.

d) Weeding : The correlation criteria The previous method does not always work, especially in scenes with high relief where the homography is a bad approximation. Thus, we propose this correlation method which is based on image information instead of geometrical information. We are going to compute the correlation between each pair of segments s , s ′ and the match that minimizes this correlation cost will be kept as the best match. This method is based on the same idea as the Kanade-Lucas algorithm. Ø Reorder end-points Consider a segment s and one of its possible match s ′ . To be able to compare them, we must first make sure that their end-points correspond. x ′f

xs x

s

End-points matching

s′ x ′s

xf Image 0

s

xs x

x ′f

s′ x ′s

xf

Image 1

Image 0

Image 1

Figure IV-4: Sorting of the endpoints.

To do so, we just apply this small algorithm: • From s = x f , x s and s ′ = x ′f , x ′s

[

• •

]

[

]

x   x′ Compute the dot product d of vector v s =  f  and vector v s′ =  f  xs   x ′s  x ′s  If d < 0 then v s = v s and v s′ =   .  x ′f 

  

Otherwise, they are already well sorted (i.e. matched). Ø Correlation of regions of interest Now, we have the problem that a segment s and s ′ do not have the same length because the Canny-Oxford algorithm and the lines fitting algorithm do not have the notion of images sequence.

20

If we want to compute the correlation between those two segments, we need to define the same area (a rectangle in our case) around the segments s and s ′ .

s′

s

L

L′ ≠ L

l′ = l

l Image 0

Image 1

As we can see from the figure, the widths l and l ′ are the same because defined by the users as a parameter. But the lengths L and L′ are not the same because they depend on the length of the segments. Here, we use the homography matrix H to compute the segment s t , the transformed segment of s , and the inverse of the homography matrix H −1 to compute the segment s ′t , the transformed segment of s ′ . Image 1

Image 0

U

s′

H −1

s ′t

I

I

s

s

U

H

The 2 red points represent the endpoints of the projected segment

t

U for Union I for Intersection

s′ . t→

The 2 red points represent the endpoints of the projected segment

st → .

Then, we project s t on the line defined by the segment s ′ and s ′t on the line defined by the segment s . The projected segments are denoted s t → and s ′t → . We have the choice: either to keep the union or the intersection of the segments. Practically, we use the intersection because we are sure that this part of the segment is an edge. When using the union, it means that we stretch the segments and assume that the new parts of the segments are corresponding to edges. The segments resulting from the intersection are called the intersection segments and are denoted sr in image 0 and sr′ in image 1. And they have the same length what was our goal. We call “refining” the process of adjusting the length of the segments. Ø Simple correlation Now, we can apply our correla tion on those 2 segments. The method we use:

21



Change the reference frame from the image reference to the window reference for both images yw x We use a translation and a T0

rotation to change the reference frame. The same kind of transformation T1 is applied to the image 1

sr xw

y •

Image 0

Divide the windows in small squares and compute the correlation between the two with the formula: S=

l −1 L=1

∑ ∑ W (i

i w =0 j w =0

w

y ′w

yw

, j w ) − W ′( i w , j w )

L

xw

sr

sr′

x ′w

l We divide S by L2 such that the match of segment s with a long segment s ′ will be preferred to the match with a short segment. Indeed, we want to match the longest segments. To get the value of W (i w , j w ) and W ′(i w , j w ) , we need to make a bi- linear −1  i   T .i  interpolation of the neighboor pixels of  o  =  o−1 w  .  j o   To . j w  d io = io − E (i o ) We pose: d jo = j o − E ( j o )

Thus:

W (i o , j o ) = di o .djo .I 0 ( E (i o ), E ( j o )) + (1 − di o ).dj o .I 0 ( E (i o ) + 1, E ( j o )) + di o .(1 − dj o ).I 0 ( E( io ), E ( jo ) + 1) + (1 − di o ).(1 − dj o ). I 0 ( E( io ) + 1, E ( j o ) + 1)

Ø Elaborate correlation At this point, we had a complete method to match segments. But, we decided that it was possible to improve it after raising a problem due to the homography limitations. Indeed, we know that the homography gives an approximation only of the positions of the segments s ′ in image 1. This error may disturb the matching in some cases. Below is an example of the kind of problems that can occur: Image 0

Image 1

st

s

H

s′

22

We see that the homography transformation has moved the segment s t in the direction of the segment s ′ and that this slide will lead to a low correlation (because the red pixels in image 0 correspond to the black pixels in image 1). To solve this problem, we only have the solution to look for the best correlation in the direction of the segment s or s ′ . t x  So, we introduce a translation vector t w =  wy  that has the direction of segment s and whose tw  norm will vary between two boundaries defined by the user, as a parameter. (1) First part of the search for the minimum of correlation The best (i.e. smallest) correlation is found this way: t wx = 0 For t wy = -boundary to boundary, step +1 do  T −1 .t x  t =  o −1 wy   To .t w  C =Correlation( s + t , s ′ ) If C < C min then Cmin = C tmin = t EndIf

EndFor We do not use the sub-pixel accuracy because the translation vector we use has only discreet values. Thus, it means that we have not found the minimal correlation. But, we know that with the translation we are not far from it. (2) Second part of the search for the minimum of correlation So, we need to look around the solution given by the translation for the best subpixel possible solution. For that, we use the Levenberg-Marquardt minimization algorithm. It takes in inputs the coordinates of the translation vector t we found. It will modify the coordinates of t to minimize the correlation: Correlation

C

Minimum found after (2) Minimum found after (1)

Vector

t

The square points represent the discreet values of the vector t used in (1)

Now, we have corrected the error introduced by the homography. We have computed an elaborate correlation criterion which will be used to distinguish which segment in image i+1 is the best matching solution for a given segment in image i.

23

The overall segment matching algorithm We present below our general algorithm to match the segments. It takes as input a set of segments {s segment, view} and gives as output a set of matches { m = s, s ′ } The general idea of this algorithm is that it considers a segment in the view 0 and it tries to track it in the following views. Then, it considers the next segment in the view 0 and tries to track it. And so on for all the segments. Our algorithm does not allow that a segment is matched with several segments. It’s a “1 to 1” matching. So, if a match m = s, s ′ exists between two segments and if we found that the segment s ′ matches better with the segment sother then the match m is deleted and the match

s other , s ′ is created. For each segment s j0 in the view 0 do

scurrent = s j0 For each view i (started from i = 1 ) do s tcurrent = Get_transform( scurrent ) For each fit segment s ji in view i do If Preliminary_weeding ( scurrent , s ji ) Then

C = Criterion( scurrent , s ji ) If C < C min and C < ε Then If Already_matched( s ji ) then

s ji −1 = Get_match( s ji ) If C < Criterion( s ji −1 , s ji ) Then

Cmin = C s min = s ji Endif Endif Endif Endif Endfor If C ≠ ∞ Then Set_match( s current , s min ) scurrent = s min

Else Exit() // End this current For loop. Endif Endfor Endfor The Criterion function corresponds to the correlation criterion (or the geometric distance in the former approach). The threshold ε is used to avoid matching segments that are two much different. Sometimes, a segment in image I has not a match in image I+1.

24

The Get_transform( s ) function transform the segment s in image I by the homography between image I and image I+1 and returns the transformed segment s t . Results For the geometric distance criterion, we show only a case where the algorithm fails to match correctly the segments.

s ′j

si

Image 0

Image 1

In those images, the red segments s i and s ′j have been matched although we can obviously see that they do not correspond to the same edge. The error is due to the homography that badly transforms the segments from image 0 to image 1. The image below shows the transformed segments. Image 1

sit

The segments from image 0 after being transformed by

H

The segment sit is the closest segment from the segment s ′j ; it is why they have been matched together. We have raised a limitation of the geometric distance criterion. With the correlation criterion, it will not happen because of the threshold, chosen properly. Here are the results for the correlation criterion on the same images:

25

Image 0

Image 1

The matches are all correct. The matching results are better with the correlation criteria than with the geometric distance criteria. But the first one is time consuming. A solution could be to use first the geometric distance criteria to make a weeding and then run the correlation criteria to choose the remaining possible matches. At this point, it is important to have in mind that we have matched segments in a sequence but the endpoints of those segments do not match. Explanation of the choices •



We have said that the homography was not as accurate as the fundamental to describe the relation between two images. But, we use it to transform the end-points of the segments because it is easy to use ( x ′ = H .x ) while the fundamental matrix does not give points but lines ( l ′ = F. x ). In the tracking process, we first detect the segments in each image independently. Another approach would be like Kanade-Lucas: Detect the segments in the first image only and then try to track them in the next images. This method is really time consuming because for each segment, it would need to compute the best correlation by modifying the orientation and the position of the correlation window.

26

V. 3D Reconstruction At this point, we have matched points and matched segments in the sequence. We will now use them to create a 3D reconstruction of the scene. We followed two different approaches to create this virtual model. The first one is a polygonal textural model where each point and segment in 2D is reconstructed on 3D, and used as a vertex or boundary of the final mesh. The second method is a dense matching technique where we reconstruct the depth of each pixel in the image. They have in common their first three steps which is the so-called stratified method. They consist in finding the camera matrices P i , the 3D points X i and the 3D segments. Below, we give a scheme of all the stages to obtain a polygon textured model and a dense matching model. Matched points and segments

Projective Reconstruction

Quasi-affine Reconstruction

Disparity map

Affine Reconstruction

Euclidian Reconstruction

Meshing

Dense matching

Figure V-1: The general scheme for the reconstruction process

27

A. Projective reconstruction The set of segments that have been obtained by the segment tracking algorithm cannot be use as it is. Indeed, the endpoints of a segment s = x f , x s in image I do not match with the endpoints of its corresponding segment

[ ] s ′ = [x ′ , x ′ ] in image I+1 ( f

s

x f , x ′f

and x′s , x ′s ).

To bypass this problem, we will not use the segments but the lines corresponding to those segments. It means that for each segment, we compute the line passing through its endpoints: l = x f × x s (where × denotes the cross product). Thus, we had a set of segments matches

s, s ′,...., s (n −1)

and now we have a set of lines matches l , l ′,...., l ( n−1) . As now we have a set of matched points and matched lines, we cannot use the fundamental matrix which deals only with points. We use a different structure, called the trifocal tensor, which plays an analogous role in three views to that played by the fundamental matrix in two. We will first introduce the trifocal tensor and then show how we use it to compute the camera matrices, the 3d points and 3d segments. We will only deal with three views for the clarity of the explanation but the approach can be easily extended to n views. The different ways to do it has been explained in [2 ]

1. The trifocal tensor a) Description The Trifocal tensor is 3x3x3 tensor which contains the epipolar geometry of three views. Once computed, it allows one to extract the camera matrices of the three views in their canonical form (true up to a projective transformation). We have followed the explanations given in [3]. Suppose we have three images to which correspond three cameras P , P′ and P′′ . The cameras are defined (we use the canonical form): P = [I 0] , P′ = [A a4 ] , P′′ = [B b4 ] For each 3d points X i , we have the relations: x i = PX i , xi′ = P′X i , xi′′ = P ′′X i Imagine a line matched in three views. L

π

π ′′ l ′′ C ′′

π′ l

l′

C C′ Figure V-2: A 3D line corresponds to a line matched in 3 views.

28

The equations of the three plans are:

l π = P T l =    0  AT l ′  π ′ = P ′T l ′ =  T   a 4 l′   BT l ′′  π ′′ = P ′′ l ′′ =  T   b4 l ′′  T

A point X that lies on L satisfies those three equations: π T X = π ′ T X = π ′′T X = 0

 l AT l ′ B T l ′′ We define a 4×3 matrix M = [m1 , m2 , m3 ] = [π , π ′, π ′′] =   T T 0 a 4 l ′ b4 l ′′  Thus: M T X = 0 As the rank of M is 2, there is a linear dependence between its columns mi . We can write: m1 = αm2 + βm3

(

)

( )

We compute α and β : α = k b4T l ′′ and β = −k a T4 l ′ for an unknown scalar k . We replace those values: l = ( b4T l ′′) AT l ′ − (a T4 l ′) B T l ′′ For each coordinate l i , we have: l i = l ′′ T ( b4 a Ti )l ′ − l ′T ( a 4biT ) l ′′ = l ′ T ( a ib4T )l ′′ − l ′T (a 4 biT )l ′′ This relation can be written: l i = l ′ T Til ′′ with Ti = aib4T − a 4biT ( Ti a 3×3 matrix) We obtain: l T = l ′T [T1 , T2 , T3 ]l ′′ = (l ′ T T1l ′′, l ′T T2l ′′, l ′T T3l ′′) The set of three 3×3 matrices {T1 , T2 , T3 } constitute the trifocal tensor. We introduce a further notation: Ti jk denotes the k-th column of the j- th row of the -i th matrix. From this relation, we may deduce others relationships that we will use for the computation of the trifocal tensor.

29

(1) Lines correspondence: We write the cross product: (l ′T [T1 , T2 , T3 ]l ′′)[l ]x = 0T3 .  0  T l ′ [T1 , T2 , T3 ]l ′′  l 3 −l  2

(

)

− l3 0 l1

( (

T

 l ′T T2l ′′l 3 − l ′T T3l ′′l 2  l2   T   T T − l1  = 03 ≡  − l ′ T1l ′′l 3 + l ′ T3l ′′l1  = 0 3  ′T ′′ ′T ′′  0   l T1l l 2 − l T2 l l1 

) )

( (

) )

( (

) )

(

)

(

)

T

 l 1′T211 + l ′2T221 + l ′3T231 l1′′l 3 + l1′T212 + l 2′T222 + l ′3T232 l ′2′l 3 + l1′T213 + l 2′T223 + l ′3T233 l 3′′l 3 −     l 1′T311 + l ′2T321 + l ′3T331 l1′′l 2 − l 1′T312 + l ′2T322 + l 3′T332 l 2′′l 2 − l1′T313 + l 2′T323 + l 3′T333 l 3′′l 2       − l1′T111 + l 2′ T121 + l 3′T131 l1′′l 3 − l1′T112 + l 2′T122 + l3′T132 l 2′′l 3 − l1′T113 + l 2′T123 + l 3′T133 l 3′′l 3 +   0     11 21 31 12 22 32 13 23 33 ≡  l1′T3 + l ′2T3 + l ′3T3 l1′′l1 + l1′T3 + l 2′T3 + l ′3T3 l ′2′l1 + l 1′T3 + l 2′T3 + l 3′T3 l 3′′l1  =  0  0       11 21 31 12 22 32 13 23 33 ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ l T + l T + l T l l + l T + l T + l T l l + l T + l T + l T l l −  1 1 2 1 3 1 1 2 1 1 2 1 3 1 2 2 1 1 2 1 3 1 3 2   l ′T 11 + l ′ T 21 + l ′T 31 l ′′l − l ′T 12 + l ′ T 22 + l ′T 32 l ′′l − l ′T 13 + l ′ T 23 + l ′T 33 l ′′l  2 2 3 2 1 1 1 2 2 2 3 2 2 1 1 2 2 2 3 2 3 1  1 2     

(

(

)

( (

)

) )

(

( (

)

) )

(

( (

)

) )

T

 ε ri1l r l ′j l ′k′Ti jk   0  ri 2    jk ≡  ε l r l ′j l k′′Ti  =  0   ri3 ′ ′′ jk   0    ε l r l j l k Ti 

≡ ε risl r l ′j l k′′Ti jk = 0 s

(V-1)

Where the 3×3 matrix [l ]× denotes the cross product matrix. And the tensor notation ε ijk is defined for i , j , k = 1,...,3 as follows:

ε

ijk

0 unless i , j , k are distinct   = − 1 if ijk is an even permutation of 123  + 1 if ijk is an odd permutation of 123 

(2) Points correspondence:  3  We can find the same relation for the points: [x ′]×  ∑ x iTi [x ′′]× = 0 3×3  i =1 

30

(

)(

)

We use this form: x i x ′ j ε jpr x ′′ k ε kqs Ti pq = 0 rs

(V-2)

The tensor notation ε ijk is defined for i , j , k = 1,...,3 as follows:

ε ijk

0 unless i , j , k are distinct   = + 1 if ijk is an even permutation of 123  − 1 if ijk is an odd permutation of 123 

Details of this result can be found in [3]. b) Computation of the trifocal tensor We need three images in which we have matched points and segments. The algorithm to compute the trifocal tensor is described below: • Create a 27×1 vector t containing the entries of the trifocal tensor T . • From a set of points matches and lines matches (at least 7 points, 13 lines or a combination of points and lines), fill the n×27 matrix A with equations (V-1) and (V2). • Solve the system At = 0 by minimizing At with t = 1 (to avoid the solution t = 0 ). It is done by computing the SVD of A , A = UDV T , and choosing t equal to the last row of V .

2. The camera matrices Here, we will just give the formulae to extract the camera matrices. As we use the canonical form, the first matrix is P = [I 0 ].

P′ = [[T1 , T2 , T3 ] e ′′ e′]

P′′ = [( e′′e′′T − I )[T1T , T2T , T3T ] e ′ e′′] with e ′ and e ′ the epipoles.

3. The 3D points With the three cameras matrices and the n matched points, we can reconstruct the 3D points. Xi x ′i′ xi

C

C ′′

xi′

C′

Figure V-3: A 3D point corresponds to a point matched in 3 views

31

Indeed, we have the relations for each point ( X i ; xi , x ′i , x ′i′ ) : x i = PX i , x′i = P′X i , x ′i′ = P ′′X i The first relation is equal to:  Xi   x i   p11 p 12 p 13 p14    x i = p11X i + p 12Yi + p13 Zi + p14 Wi     Yi    y i  =  p 21 p 22 p 23 p 24   ≡  y i = p 21X i + p 22Yi + p 23Z i + p 24 Wi w   p  Zi   i   31 p 32 p 33 p 34  W  w i = p 31X i + p 32Yi + p 33Z i + p 34 Wi  i

 (p X + p 32 Yi + p 33 Zi + p 34 Wi ).x i = (p11X i + p12 Yi + p13 Z i + p14 Wi ).w i ≡  31 i (p 31X i + p 32 Yi + p 33 Zi + p 34 Wi ).y i = (p 21X i + p 22 Yi + p 23 Zi + p 24 Wi ).w i As it is homogeneous coordinates, w i can be chosen equal to 1:

 (p X + p 32 Yi + p 33 Zi + p 34 Wi ).x i = (p11X i + p 12Yi + p 13Z i + p 14Wi ) ≡  31 i (p 31X i + p 32 Yi + p 33 Z i + p 34 Wi ).y i = (p 21X i + p 22 Yi + p 23 Z i + p 24 Wi )  (p .x - p )X + (p 32 .x i - p12 )Yi + (p 33.x i - p13 )Z i + (p 34 .x i - p 14 )Wi = 0 ≡  31 i 11 i (p 31.y i - p 21 )X i + (p 32 .y i - p 22 )Yi + (p 33.y i - p 23 )Zi + (p 34 .y i - p 24 )Wi = 0

 Xi     p 31.x i - p11 p 32 .x i - p 12 p 33 .x i - p 13 p 34 .x i - p14  Yi    ≡ Ai X i = 0 ≡   p 31.y i - p 21 p 32 .y i - p 22 p 33.y i - p 23 p 34 .y i - p 24  Z i  W   i We can obtain the same equations for P′ and P′′ : A′i X i = 0, Ai′′X i = 0 We create a matrix AX i for the point ( X i ; xi , x ′i , x ′i′ ) :

AX i

 Ai    =  A′i   A′′  i

Thus, for each point ( X i ; xi , x ′i , x ′i′ ) , we must solve: AX i X i = 0 . So, it is again a minimization of AX i X i with the constraint X i = 1 As we have already done, we use the SVD of AX i : AXi = UDV T . X i is the last column of V . We have completed the 3D reconstruction of the points.

32

4. The 3D segments We need to reconstruct the 3D segments in order to achieve better results. We will here describe two approaches that we implemented and will show the results. In the first approach, we expand the 2D segments into infinite lines. We reconstruct from these corresponding lines a 3D line and determine the 3D segmentsThe second uses the 2D segments only. It adjusts the 2d segments so that their endpoints correspond (i.e. match), adds those new endpoints to the 2D points and run the algorithm to reconstruct the 3D points we described above. • First method: 3D approach We consider a line in three views (i.e. l , l ′, l ′′ ). The algorithm presented here works for more than 3 views and has been implemented this way. The restriction to three views is made for the clarity and the simplicity of the explanations and because for three views or more, the principle is the same. Thanks to the cameras matrices, we can compute the planes π , π ′ and π ′′ which correspond to the intersection of the camera center and the 2D line in the image planes: π = PT l π ′ = P ′T l ′ π ′′ = P ′′T l ′′ As we can see on the figure below, the planes do not intersect in only one 3D line. In most of the cases, we will obtain 3 lines (dashed lines in Figure V-4) because of the imprecision of the 2d matched lines. L

π ′′ l ′′

π

s′ π′

s

l′

l

s′

C ′′

C C′ Figure V-4: The planes intersect in 3 lines.

We search for the 3D line L as close as possible from the dashed lines.

33

To determine this line, we form a 3×4 matrix A : π 0 π 1 π 2 π 3    A =  π ′0 π 1′ π 2′ π 3′   π ′′ π ′′ π ′′ π ′′   0 1 2 3 The singular value decomposition of A is: A = UDV T . The best rank 2 approximation to A is given by the two first vectors of V . Two planes define a 3D line. Thus, those two planes are the two first columns of V (corresponding to the two largest singular value). The found line L does not correspond to any lines in the images. We want the planes π , π ′ and π ′′ to intersect in one 3d line. We look for the Maximum Likelihood estimate of the line L in 3-space by minimizing a geometric distance between L projected into each image and corresponding segment. The method uses to do compute the best 2D lines and the best unique 3D line reposes on Levenberg-Marquardt minimization. Here the way to proceed: Compute the three new planes π C , π C′ ′ and π C′′′′ containing the camera centers C , C ′ and C ′′ and the 3D line L . Compute the 2D lines l C , l C′ ′ and l C′′′′ corresponding to the intersections between the image planes and the planes π C , π C′ ′ and π C′′′′ . Project the endpoints of the segments s = x f , x s , s ′ = x ′S , x ′f and s ′′ = x′′f , x′s′ on the lines

[

]

[

]

[

]

l C , l C′ ′ and l C′′′′ respectively. Compute the distances d 2 ( x f , l C ) , d 2 ( x s , l C ) ,

d 2 ( x′f , l C′ ′ ) , d 2 ( x ′s , l C′ ′ ) , d 2 ( x ′f′ , l C′′′′ ) ,

d 2 ( x ′s′, l C′′′′ ) . Modify slightly the 3D line L to minimize those distances

Now, the 3D line L projects to the 2D lines l C , l C′ ′ and l C′′′′ . From the 3D line, we may deduce a 3D segment. The endpoints of this segment correspond to the back projection of the endpoints of 2D segments. We have determined those 2D endpoints by projecting the endpoints of the segments s = x f , x s , s ′ = x ′S , x ′f and s ′′ = x′′f , x′s′ on the

[

]

[

]

[

]

lines l C , l C′ ′ and l C′′′′ . The projected segments are called s p , s ′p and s ′p′ . For each projected segment, we obtain two 3D points lying on the line L . Thus, we keep the two 3D points that maximize the length of the 3D segments. l C′ ′

lC

l C′′′′

s′

s Image 0

s′

Image 1

Image 2

Figure V-5: The green points are the endpoints of the projected segments

s p , s ′p and s ′p′ .

34

To back-project the endpoints of the projected segments, we use the pseudo inverse of the cameras matrices: P + = P T ( PP T ) −1 . For example, the endpoint x f of the segment s p is back-projected to the point: X f = P + x f . We compute the 6 rays passing through the back-projected points and the cameras centers C , C ′ and C ′′ . Each ray intersects the line L in a 3D point. Thus, we keep the two points that define the longest 3D segment. To conclude, we had a set of matched 2D segments. After applying our algorithm, for each 2d segment match, we have a corresponding 3D line and the endpoints of those 2D segments also match. The images below show the results of our algorithm. The 3D segments have been projected into the images and displayed.

As one can see, some segments are incorrect. The bad segments are all close to the epipolar lines. Such a reconstruction is almost singular and gives large errors. L

l′

l

Epipolar plane

C′

C C

Epipolar plane Figure V-6: Line reconstruction degeneracy

In Figure V-6, the planes back-projected from the lines l and l ′ are almost contained in an epipolar plane. In this case, the line L is usually poorly determined; a small error in the image

35

gives a large error in the 3D reconstruction. That is why we get such bad lines in our reconstruction. The explanation is given for two views but it is the same for three views. The only way to solve this problem is to remove the segments that are closed to an epipolar line. This problem is a huge limitation. For the construction of the mesh, those segments would be as important as the others to improve the accuracy of the mesh. For the others segments, we get a good accuracy and we have the property that they satisfy the relation (V-1) and their end-points the relation (V-2). • Second method: 2D approach This time we deal only with 2D information. We start from the point where we have tracked segments in a sequence. The endpoints of those segments do not correspond. To make them match, we use the homographies between the images. The method is the same as the one described in IV-B-3 where we explained how to refine two segments so that their length is the same using the homography. Consider a segment match over a sequence, denoted s, s ′,..., s (i ) ,..., s ( n−1) . The length of the segments is different in each image. Using an iterative method, we get the segment endpoints to match pair by pair (image I with image I+1), we do that for all the images until the endpoints correspond in all the images.We proceed this way: i=0 nb _ iterations = 0 While (i < ( n − 1) ) or (nb _ iterations < nb _ iter _ max ) do Refine( s ( i) , s ( i+1) , s (ir) , s (ri+1) ) // s (ri) = x rf(i ) , xrs( i) , s (ri+1) = x (rfi+1) , xrs( i+1)

(

)

(

)

[

]

[

]

If d x (fi ) , xrf( i) < ε and d x s(i ) , xsf( i) < ε Then s ( i) = s r(i ) s ( i+1) = s r( i+1) i = i +1 Else s ( i) = s r(i ) s ( i+1) = s r( i+1) i = i −1 EndIf nb _ iterations = nb _ iterations + 1 EndWhile The threshold ε is chosen equal to 0.1 (we tolerate a maximal distance of a 10-th of pixel) and nb _ iter _ max to 10 * n . The notation d ( a, b) denotes the distance between 2D points a and b . The method Refine( s ( i) , s ( i+1) , s (ir) , s (ri+1) ) computes the longest segments the way we described it page 22. We apply this algorithm to all the matches.

36

Suppose we know the homography between two images, we know that when we transform the points of the first image by this homography, we get an approximate position of the points in the second image. So, it means that the endpoints of the segments will not ma tch exactly but we have a good approximation. We add the endpoints of the segments to the 2D points obtained with Kanade-Lucas + Weeding and we run the algorithm to compute the 3D points, described in V-A-3. The images below show the results. This time, we do not have bad segments. Image 0

Image 1

Image 2

The endpoints of the segments do not always match perfectly as show in the pictures below, extracted from the images 0 and 1.

To solve this lack of precision, for each matched segment, we compute the correlation between their endpoints such that they will perfectly match. When refining the segments (i.e. matching the endpoints of the segments), we tried to use the fundamental matrix instead of the homography. But, for the segments that are almost parallels to the epipolar lines, the refining does not work properly and does not improve significantly the results in general. Below, we describe a case of failure.

37

[

]

[

]

To refine two matched segments s = x f , x s and s ′ = x ′f , x ′s with the fundamental matrix, we first compute the epipolar lines l e and l e′ corresponding to x f and x ′f (and then x s and x ′s ).

l eT = x ′f F l e′ = Fx f We see that a small error on x f (image 0) changes the epipolar line l e′ (image1) of angle δ ′ but the matching point x ′r is quite the same ( x ′r 0 ≈ x′r1 ). But, a small error on x ′f (image 1_changes the epipolar line l e and the matching point can be really modify ( x r 0 ≠ x r1 ).

δ le

x ′s

xs

x ′f

xf

xr 0

x r1

e Image 0

e′

l e′

x ′r 0

δ′

x ′r1

Image 1

For others lines, not almost parallel to the epipolar lines, it works properly. We have the same problem as the one we had in the first method (the 3D approach): the segments almost parallels to the epipolar lines cannot be correctly determined. Thus, we discarded this fundamental matrix approach because it was too noise sensitive and not general enough.

5. The reconstruction ambiguity Now, we know the 3D points, the 3D segments and the cameras matrices. Below, the images show the reconstructed scene from the cameras point of view. We have used for the 3D segments the second method (the 2D approach).

38

View 0

View 1

View 2 In the view 0, we can identify the scene but in the others views, it is difficult recognize it.

One problem we can immediately arise from the images is that all the points are not in front the cameras. Figure V-7 gives an explanation to this situation. Points can be on either side of the image plane and project into the image at the good position.

C

Figure V-7: The left figure shows a camera with 3 points. The right figure shows the projection of those 3 points in the image plane.

Consider a camera matrix P determined by the method we described above. This camera maps the 3D points X to 2D points x according to x = PX . Suppose an unknown 4×4 invertible matrix H (i.e. H −1 H = I 4 ). Thus, we can replace the points X by HX and the camera P by PH −1 and the 2D points will not be modified: x = PX = PH − 1 (HX )

(

)

39

It means that we did not determine the cameras matrices uniquely, they are known up to an unknown 4×4 invertible matrix H . H is called a projective transformation or projectivity. In others words, we know the camera matrices up to a projective transformation. When transforming a 3D scene by a projective matrix, the lengths, the areas, the angles, the parallelism properties are changed. The scene after transformation is completely modified.

H Figure V-8: The left image is a view of the original scene. The right image shows the house after a projective transformation H .

The reconstruction made with the trifocal tensor is projective. We must find the transformation H −1 to obtain a correct reconstruction that respects the Euclidian invariants. We proceed in two steps. We first upgrade to the affine reconstruction and then to the Euclidian reconstruction. Now, when we talk about 3D points, we include the 3D points computed out of the Kanade Lucas output and out of the segments endpoints.

B. Affine reconstruction The first part of the affine reconstruction consists in modifying the cameras and the 3D points so that all the 3D points are in front of the cameras. This first step is called quasi-affine reconstruction. The goal of the second part is to locate the plane at infinity and find a transformation that will send it to its original position.

1. Quasi-affine reconstruction The plane at infinity has the equation π ∞T = (0 0 0 1) in a Euclidian world. It contains all the 3D points whose coordinates are X T = (X Y Z 0) . In the original scene, the plane at

40

infinity is well located (i.e. π ∞T = (0 0 0 1) ). But, when we record a scene with an imaging device and then we make a projective reconstruction, the plane at infinity has changed and is not at the right position. Indeed, when transforming the plane at infinity by a projective transformation H −1 , we get:  h11 h21 h31 h41  0   h41        h12 h22 h32 h42  0   h42  T πt = H π ∞ =  = h13 h23 h33 h43  0   h43       h  h  h h h  14 24 34 44  1   44  If we look again at the projective reconstruction we obtained from a different point of view, we get:

Normally, the plane at infinity wraps all the points, segments and cameras of the 3D world. But, because we only have a projective reconstruction, we have points on both sides of the plane at infinity. In the image above, the red line separates the points that are on one side of the plane at infinity and the points that are on the other side. Inside the red circle, the point where lines intersect belongs to the plane at infinity. Indeed, we know that the lines that are parallels in the Euclidian world intersect on the plane at infinity (here, the arrows indicate parallels lines in the reality). In the projective world, those lines intersect in one finite 3D point. It means that this point is the image of an infinite point lying on the plane at infinity. Thus, we first transform all 3D points X i and cameras P j so that all the points lie in front of all cameras. To do so, we must find the transformation H that ensures that: depth( HX i , P j H −1 ) > 0 . We take the following result (see [3] for the proof):

41

depth( HX i , P j H −1 ) =(π tT X i )(π tT C j )δ if H −T π t = (0,0,0,1) . .

T

C j is the center of the camera j , δ is a scalar whose possible values are {− 1,1} and a = b means that a and b have the same sign. We find the unknown plane π t that satisfies the following inequalities with a simplex method described in [4]: X iT π t > 0 for all i .

δC jT π t > 0 for all j They are called the cheiral inequalities. δ corresponds to the sign of the determinant of H . As it is not known a priori, solutions to the equations must be sought for both values. 0 0 0   1   1 0 0   I3 0 3   0 ˆ  = We form the matrix H =   and transform X i = HX i and T  0 0 1 0 π  t     π tx π ty π tz π tw    j j − 1 Pˆ = P H . The solution to the inequalities is not unique. It means that we have determined a plane π t that is not in fact the plane at infinity. But, now, all the 3D points are in front of the cameras, so the reconstruction is called quasi-affine. The images below show our final quasi-affine reconstruction we made.

Image 0

Image 1

Those images are obtained by projected the 3D points and 3D segments in the image using the cameras matrices.

Image 2

42

2. Affine reconstruction To upgrade to the affine reconstruction, we must find the plane at infinity π t . Indeed, parallel lines meet on the plane at infinity. When we find the coordinate of this plane, we can then compute the transformation which sends it to infinity and obtain an affine reconstruction. This is because parallel lines in the real world are also parallel in an affine reconstruction. We first find the boundaries of the subspace where it is. We know that it wraps all the points and cameras, so its last coordinate can be chosen equal to T 1. Indeed, if we want this inequality π tT C 0 ≠ 0 ≡ (π tx π ty π tz π tw )(0 0 0 1) ≠ 0 to be respected, we choose π tw = 1 . We solve the cheiral inequalities to maximize π ti or − π ti ( i ∈ {x, y, z}). We now have the boundaries of the image of the plane at infinity. In this subspace, we perform an exhaustive search for the plane π t that minimizes the image distance between the 3D points HX i projected with the cameras P j H −1 and the measured 2D

I3   . points in the images with H =    π tx π ty π tz 1 We know have an affine reconstruction.

C. Euclidean Reconstruction We will not explain this part in details because we did not implement it. Just know that this time we are concerned by the absolute conic Ω ∞ .

In the Euclidian world, a point X = (X Y Z W )T on the absolute conic satisfies the

X 2 + Y 2 + Z 2 = 0 equations:  , (thus, the point X belongs to the plane at infinity) W = 0 In the affine reconstruction, the absolute conic is still on the plane at infinity but have a different equation. Ω ∞ is imaged by each affine camera P j as the image conic ω tj . Suppose we have found ω t0 (in the first camera).

( )

 A −1  0 −1  with AAT = We form the transformation H =  ω t 1  And we apply Xˆ i = HX i and Pˆ j = P j H −1 to upgrade the affine reconstruction to an Euclidian reconstruction.

D. Meshing This part has not been implemented in vxl, so we just give an overview of the method. The meshing consists in creating triangle between the 3D points. The triangulation method chosen is generally Delaunay

43

As we have 3D segments, we want to constraint the Delaunay triangulation. The algorithm is well described in [3] Then, a texture is applied to the mesh. The texture is computed from one or severall images of the original sequence.

E. Dense matching The dense matching is another way to reconstruct a scene from an image sequence. Its goal is to find for each pixel in the first image its corresponding pixel in the second image. With this information, we create a depth map and then a 3D reconstruction of the scene. The best aspects of this technique is that we will compute the 3D location of each pixel in the image. This will give us a reconstruction much smoother and closer to the reality. Our method use two images only. Image 0

Image 1

Figure V-9: Dense matching

To do so, we need to compute the fundamental matrix between the two images. The computation of the fundamental matrix is made from the points tracked in the images with Kanade-Lucas + Weeding. Now, we have the relation between the images: x ′T Fx = 0 From this relation, all we may deduce is that the corresponding pixel x ′ to a pixel x in image 0 lies on the line l ′ = Fx in image 1. To find the match of the pixel x , we have to seek along the epipolar line l ′ for the pixel x ′ whose value (i.e. intensity) is the closest to the pixel x ’ one. In fact, as the relation is symmetric, an epipolar line l T = x ′T F in the first image corresponds to an epipolar line l ′ = Fx in the second image. Image 0

Image 1

j

l′

l

i

j

i Figure V-10: The corresponding epipolar lines

44

We will transform the images such that the rows of the resulting images are the epipolar lines. This transformation of the images is called rectifictation. Thus, if the row i corresponds to the epipolar line l in image 0 then the row i corresponds to the epipolar line l ′ in image 1. So, the match of a pixel x = (i , j ) in the rectified image 0 belongs to the row i in the rectified image 1.

1. Image rectification Suppose we know the fundamental matrix F between image 0 and image 1. So, we may obtain the epipoles with the formulae: Fe = 0 and F T e ′ = 0 To make the epipolar lines become the rows of the rectified images, we just have to send the epipoles to infinity. Below is the process:  1 0 − ci    • We apply the translation H t =  0 1 − c j  to the first image so that the center of the 0 0 1    image (c i , c j ,1) is at the origin.



ej    e i2 + e 2j  ei Then, we apply the rotation H r =   e 2 + e2 j  i 0   

(



ei ei2 + e 2j ej ei2 + e 2j 0

)

 0   0  to the first image to put   1  

the epipole e at 0, ei2 + e 2j ,1 .



• •

  1  0 0   Now, we sent the epipole to infinity with H ∞ =  0 1 0  in image 0.   1 1 0 − 2   e i + e 2j   Thus, the applied transformation is: H = H ∞ H r H t . For the second image, we do not apply the same algorithm. We will transform the image 1 2 by H ′ that minimizes the sum-of-squared distances: ∑ d ( Hx i , H ′x′i ) such that the i



rectified images match. To compute the two rectified images, we use H −1 and H ′ −1 to ensure that all the pixels in rectified image 0 and in rectified image 1 have a value.

The images below the results of the rectification:

45

Image 0

90

Image 1

153

89

166

81

264

A row i in image 0 and a row i in image 1 correspond to the same line in the scene. A particular point (i, j ) in image 0 is at the position (i, j + d ) in image 1. The scalar d is called disparity.

2. Disparity map For each pixel in the rectified image 0, we can compute a disparity value equal to the displacement of the pixel between image 0 and image 1. First, we need to find the matching pixel in rectified image 1 for each pixel in rectified image 0. To do so, we suppose two constraints: • The recorded scene is continuous. • Consider two pixels x = (i, j x ) and y = (i , j y ) in rectified image 0. If j x < j y then j ′x < j ′y in rectified image 1. Therefore, we use a dynamic programming algorithm to match the pixels developed by Peter Tu. Row Row

i in image 0

i in image 1

1 pixel

Figure V-11: Matching of the pixels in two corresponding rows.

46

For each corresponding rows in the rectified images, we run a dynamic programming matching. Now, for each pixel, we know a disparity value. We store all those disparity values in an image, called a disparity map.

Figure V-12: A disparity map.

3. Segment constraints As we have developed a segment tracking algorithm, we will use the detected and matched segments to constraint the dynamic programming matching. First, we run the segment tracking algorithm between image 0 and image 1. Then, we delete the segments that are closed to the epipolar lines because they would not constraint the dense matching properly. We use the fundamental matrix computed on the Kanade-Lucas points to adjust the end-points of the segments and have them matched. As we have deleted the segments closed to the epipolar lines, we will not have the problem described at the end of V.A.4. We transform the endpoints of the segments by H in the first image and by H ′ in the second image. Now, in both rectified images, we have rectified segments. We compute the disparity values from the first endpoint to the second endpoint for all segments. Rectified image 0

Rectified image 1

d0 di d n-1

Figure V-13:

d 0 , … d i , …, d n−1 are the disparities for one segment.

With those disparities, we create a segments disparity map. The values of the map are all equal to 0 except those that belong to a segment in rectified image 0, those values are equal to the computed disparity (in Figure V-13, the scalars d i ).

47

Below is a segme nts disparity map.

Figure V-14: A segment disparity map. The red ovals show the modifications due to the use of the segments in the dense matching algorithm.

The dynamic programming matching will be constrained by the segments disparity map because we know a priori for certain pixels their disparity. For example, in some textureless areas such as walls, the original dense matching algorithm (without segment constraint) would have trouble to match correctly the pixels. But, if we have some segments in these areas such as edges between walls, the algorithm with segment constraint will know for some pixels in this area their disparity and so will be able to make an accurate matching of those pixels. For the remaining textureless areas between the edges, the nice properties of the dynamic programming will create a gradient of disparity which will be represented as flat areas in the reconstruction. Figure V-15 is a difference map computed by subtracting the segment disparity map to the disparity map. The disparity in some regions has changed and is now more accurate than it was before.

Figure V-15: A difference map

48

VI. Conclusion and Future Work We have created an original segment tracking algorithm that gives strong results, a process to reconstruct tracked points and tracked segments and a constrained dense matching method. It is integrated in VXL in the GEL library and used by the Image Understanding group. The methods and algorithms developed could be improved in different ways: • The segment tracking algorithm developed is a greedy algorithm. A more global approach could be envisaged to solve the MP complex problem raised by the segment matching issue. •

The 3D reconstruction computed is only projective (quasi-affine in fact). The next step would be to update it to an affine reconstruction and then an Euclidian one. To do so, the segments could be use to locate precisely the plane at infinity.



We constrained the dense matching algorithm with tracked segments. We could also use other kinds of feature like curves tracked with an appropriate method.

49

Bibliography [1] Canny, J., A Computational Approach to Edge Detection, IEEE PAMI 8(6) 679-698, November 1986. [2] Robert Kaucic, Nicolas Dano, Richard Hartley, Plane-based Projective Reconstruction, ICCV 2001. [3] Richard Hartley, Andrew Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, September 2000. [4] William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. Flannery, Numerical Recipies in C, Cambridge University Press, 1992.

50