Volumetric Appraisal System for Obstacle

objective is to advise the driver if there is any obstacle when the vehicle is ... measurement; namely, ultrasonic devices, lasers, inductive and capacitive sen- ... If the necessary transformation to convert from one coordinates system to another is a .... If we take two images at t0 and t1 instants and we know the camera motion.
819KB taille 2 téléchargements 394 vues
Volumetric Appraisal System for Obstacle Detection from Optical Flow Generated by a Calibrated Camera Undergoing Known Motion IRI-DT 04-02 ´ Oscar Pereles Ligero Institut de Rob`otica i Inform`atica Industrial Universitat Polit`ecnica de Catalunya - CSIC Llorens i Artigas 4-6, Edifici U, 2a pl. Barcelona 08028, Spain [email protected] September 2004

1

Abstract This report describes a simple obstacle detection system based on a computer vision application. Through a single camera in motion we adquire satisfactory scene information to decide if an obstacle really exists. Keywords: computer vision, image processing, optical flow, motion, 3D vision.

2

Index of Contents 1. Introduction 2. Optical Flow 2.1. The Pinhole Model 2.2. Representation of a point ∈ R3 in the image 2.3. Calibration Process

3. 3D Reconstruction from Optical Flow Generated by a Calibrated Camera Undergoing Known Motion 3.1. Optical Flow Analysis 3.2. Calculation of 3D Coordinates

4. Camera Calibration, Faugeras’ Method 5. Feature Detection and Feature Tracking 5.1. State of the Art 5.2. Klt Tracker 5.3. Klt Implementation

6. Experimental Results

3

1

Introduction

Imagine we want to fit up an obstacle detection system in any vehicle. Its objective is to advise the driver if there is any obstacle when the vehicle is moving backwards. The driver’s visual capability is obviously very restricted, and automatic obstacle detection could be helpful at this moment. Therefore the success of that system depends on the efficiency of the decision criteria in resolving, through sensor scene information, if an obstacle exists. In this way, the information we are interested in and the precision we need depends on each application in particular and fix sensors to be used in. Scene characterization becomes the main task in a scene based application like this, and it’s at the same time the main problem we find in the design of our obstacle detection system. By scene characterization we mean extracting relevant information about the scene which helps us to determine if an obstacle were to exist. At prior, we could consider information such as colour, textures, shapes or similar scene features as irrelevant in scene characterization. On the other hand we could consider object dimensions or locations as relevant information, and that’s why we need distance measurement sensors. Defining and finding this relevant information based on sensors information is exactly the main problem we refered to. We choose a very simple solution. On the one hand, our interest in simplicity is due to time processing problems and on the other hand in economizing devices, which is justified by the low precision needed. We are not interested in ’the color of the eyes’ of any walker, but in resolving it as an obstacle. In the following pages we develop this idea carefully. When it comes to sensor devices, there exists a lot of methods in distance measurement; namely, ultrasonic devices, lasers, inductive and capacitive sensors, microwaves proximity sensors... The use of one or the other depends on each application, and on the precision needed and on the available funds. Nevertheless, it’s not the aim of this report to introduce readers in them. Maybe laser or cameras are the most suitable for us, but we choose cameras due to their close relation with efficiency and cost. The problem which arises when cameras are chosen is, given a set of images of a scene, extracting relevant information to decide, based on a decision criteria, if an obstacle exists. Object dimensions or locations are needed in order to apply the decision criteria, and that’s why we use 3D image reconstruction methods. Notice 2D methods do not let us determine object dimensions or locations: depth is not available. On the contrary 3D image methods let even scene objects reconstruction, and this way we can locate the objects in the scene and determine their dimensions. Stereo vision or optical flow analysis belong to this group. In stereo vision we use the differences in the images captured by two cameras to calculate the position of an object. Besides stereo vision, another 3D method is motion analysis and in particular the analysis of the optical flow. If we attach one single camera to a vehicle in order to obtain a series of images while 4

the vehicle is moving, we can deduct very useful information from this data by analyzing the effect of the motion which will cause differences between the subsequent images. In general, we can detect and differentiate between moving objects and calculate their motion properties. We do this by calculating the so called optical flow, which is a vector field indicating how much and in which direction each pixel is moving. From this data, it is possible to retrieve a depth map of the perceived environment and to estimate the dimensions of the structures present. Finally in our design we choose optical flow analysis method. The main reason is the reduction in cost of devices, only one single camera is needed. Optical flow analysis is not as accurate as stereo vision and time processing is longer, but given that we do not need a high precision and information processing is very simple, it is a suitable choice. In spite of all, 3D reconstruction through optical flow analysis sets a close relation between fiability, costs and time processing.

2

Optical Flow

To relate the correspondence between a point x ∈ R3 and its respective pixel coordinates in an image, it’s necessary to analyze previously the geometric model that defines the representation of a point in a CCD camera.

2.1

The Pinhole Model

The pinhole camera model is the most widely used in computer vision systems, since models correctly a common camera with a very simple mathematical formulation. Let O represent the camera focal point. The projection of a point x ∈ R3 from O onto I determines the representation q of this point in the image. Projective geometry reduces this problem to a simple lineal transformation of points. See Figure 1. Let x ∈ R3 be an arbitrary point which projection from O onto I is q. ˜ = [U, V, s]> represents the homogeneus coordinates of any point in Then q R3 placed in the ray that intersects with the image plain at q, which can be ˜ = [x, y, z, 1]> is the expresion of x written q = [U/s, V /s]> for s = z, where x ˜ projection can be written, with applying in homogeneous coordinates. Thus, q similarity of triangles, as the lineal trasnformation: 

f ˜= 0 q 0

0 f 0

 0 0 f u v ˜ , where 0 0 x = = z x y 1 0

5

(1)

Figure 1: Pinhole camera model

Figure 2: Coordinates systems

2.2

Representation of a point ∈ R3 in the image

˜ represent a point x ∈ R3 in the camera coordinates system. And let Let x ˜ world be the representation of that point x in a coordinates system named x world. If the necessary transformation to convert from one coordinates system to another is a rigid transformation, that is, preserves angles and lengths, then ˜ is the result of rotating x ˜ world by R and translating by t. See Figure 2. x Then the transformation is defined by a rotation by R and a translation by t which, applied to the coordinates of a point in the world coordinates system, results in the expresion of the point in the camera coordinates system:

˜= x



R t 0 1



˜ world x

(2)

Finally, we must scale by the CCD width and height in order to get q in 6

pixel units. If we also translate the image plane origin to (c0 , d0 ) to center the ˜ point, which represents the q asociated pixel. We name image, we get the m these matrix as K and H respectively: 

αu ˜ = KH =  0 m 0

s αv 0

c0 d0 1

  0  R t  0 ˜ world x 0 1 0

(3)

where the skew parameter ’s’ encodes the non-orthogonality of the u and v directions. We will always assume rectangular pixels (or zero skew). In turn we define: ˜ = KH˜ m xworld = P˜ xworld

(4)

where P is the perspective projection matrix.

2.3

Calibration process

To deduce geometric information from an image, we must determine the parameters that relate the position of a point in a scene to its position in the image. This is known as camera calibration. Calibration has to be previous to any computer vision application running processes, specially when an accurate 3D scene reconstruction is needed. Currently this is a cumbersome process of estimating the intrinsic and extrinsic parameters of a camera. However, recent advances in computer vision indicate that we might be able to eliminate this process altogether. The problem consists in obtaining these intrinsic and extrinsic parameters, given one or a set of images. We understand as intrinsic parameters which are asociated with nature properties of the camera, as focal lenght or lenses distorsion. Intrinsic parameters are the parameters needed to link the pixel coordinates of an image point with the corresponding coordinates in the camera reference frame. Nevertheless, extrinsic parameters are related to the scene that surrounds the camera, camera orientation or location. Extrinsic parameters are that define the location and orientation of the camera reference frame with respect to a known world reference frame. To solve the correspondence problem given by (3) the matrix K and H have to be calculated, where K and H represent intrinsic and extrinsic parameters respectively. There have been proposed large methods to solve this problem. But, in the recent years have appeared a lot of these that try hard in using less or null information of the scene, usually called self calibration methods. Maybe the main problem of these calibration methods is that they can provide reconstruction only up to an unknown scale factor.

7

All the same, in our computer application for obstacle detection we use a multiplanar calibration method, in particular the Faugeras’ method. This kind of method is robuster than the previous, essentially because calibration uses multiple plain points information.

3

3D Reconstruction from Optical Flow Generated by a Calibrated Camera Undergoing Known Motion

Once the perspective projection matrix P has been calculated, it immediatly relates to the position of a point in a scene to its position in the image plain only by applying the transformation given by (4). That is, given 3D world coordinates it’s possible to find their corresponding 2D image coordinates. Nevertheless, given 2D coordinates of a point in the image, it’s not possible to calculate the 3D representation of this point in the world, since one single image does not provide enough information to infer depth of points, but we only can build the ray that represents the projection from O onto I of the point. This is because image coordinates need one more degree of freedom which allows the representation of 3D points and to solve this lack of information a second image has to be used. In a stereo vision this second image is provided by a second camera. In optical flow based systems another image of the sequence is necessary.

3.1

Optical Flow Analysis

The problem needed to be solved with a single camera is very similar to that of the stereo vision problem. In stereo vision we infer information on the 3D structure and distance of a scene from two or more images. Once images have been rectified we have coplanar images and the problem restricts us when solving which parts of the left and right images are projections of the same scene element. In optical flow analysis from known motion, available images have been taken from different viewpoints and in different time instants. That’s why we need to infer camera motion to solve the correspondence problem. Reconstruction by this method requires triangulation to be made through a series of images. If we take two images at t0 and t1 instants and we know the camera motion between that instants (for example in a camera fixed to a vehicle through some encoders or gyroscope sensors), then 3D reconstruction is posible. 3D reconstruction through optical flow analysis consists in assigning to some points of t0 image their corresponding in t1 . It is possible when, given an image sequence we know a camera is moving and we are able to solve the corresponding points between consecutive images.

8

As correspondence we understand which elements of a frame correspond to which elements in the next frame of the sequence. This is different to that of stereo correspondence because image sequences are sampled temporally at very high rates, and the spatial disparities between consecutive frames are, on average, much smaller in motion than in typical stereo pairs. Correspondence can easily be enhanced by exploiting the temporal aspect in motion sequences and by employing tracking techniques. The correspondence problem can also be cast as the problem of estimating the apparent motion of the image brightness pattern, usually called the optical flow. There are two strategies for solving the correspondence problem: - The Differential Methods lead to dense measures; that is, computed at each image pixel. They use estimates of time derivatives, and therefore require the images to be sampled closely - The Matching Method leads to sparse measures; that is, computed only at a subset of image points. This is what we use in our computer vision application. We will focus on the correspondence problem later. Now we will try to calculate 3D coordinates, given t0 and t1 image coordinates.

3.2

Calculation of 3D Coordinates

We approach the objective from the opposite viewpoint: The pair of coordinates ˜ in the images at (u0 , v0 ) and (u1 , v1 ) are the representation of a world point x t0 and t1 , respectively. If we know the vehicle motion, and therefore the camera motion (fixed to that), we can relate the coordinates systems of the images at t0 and t1 through a simple translation. See Figure 3. If the vehicle motion is reduced to a single coordinate in Y axis direction of the world coordinates system, we can write the translation refered to as t = (0, d, 0). Futhermore, if we are able to relate any image coordinate system to the world coordinate system we would have written 2D coordinates as a 3D coordinates transformation. Calibration provides us the P matrix related to Xw ,Yw ,Zw coordinates system and the camera position at t0 instant. Computing camera elevation or the inclination angle α is not necessary, since all these parameters have been already computed by calibration. On the other side, we name P’ to the perspective projection matrix related to Xw ,Yw ,Zw coordinates system and the camera position at t1 . P’ is simply the result of translating the t0 coordinates system by (0, d, 0). Once the problem has been analized from the contrary viewpoint, it is obvious to think that, given (u00 , v00 ) and (u01 , v10 ) as supposed image coordinates of a 3D point, we can get 3D coordinates of that point. However, we must note that the rays corresponding to the specified points proyection, let (u00 , v00 ) ˜ . Since and (u01 , v10 ) be that points, are not necessary to intersect in a point x the straight lines generally do not intersect due to inaccuracy, it is necessary to calculate this point through an adjustment method.

9

Figure 3: Motion analysis Maybe if we write ecuations that describe projections the idea looks clearer. Now we name (u0 , v0 ) and (u1 , v1 ) as supposed image coordinates of a 3D point.  x U0    V0  = P  y   z  s0 1 





 x U1    V1  = PT  y  |{z}  z  s1 P0 1 



1  0 where T =   0 0

0 1 0 0

0 0 1 0





(5)

(6)

 0 d   represents the translation matrix by (0, d, 0). 0  1

By scaling homogeneous coordinates we get u0 , v0 , u1 y v1 , where:

u0 =

U0 U1 V0 V1 , u1 = , v0 = , and v1 = . s0 s1 s0 s1

Let P = pij and P0 = p0ij , from (5), (6) and (7) yields:

10

(7)

u0 =

p11 x + p12 y + p13 z + p14 p31 x + p32 y + p33 z + p34

(8)

and at the same way u1 , v0 and v1 . We can write the system of  p31 u0 − p11 p32 u0 − p12  p31 v0 − p21 p32 v0 − p22  0  p31 u1 − p011 p032 u1 − p012 p031 v1 − p021 p032 v1 − p022 {z | A

equations as:    p33 u0 − p13  x  p33 v0 − p23   y  =  0 0  p33 u1 − p13  z p033 v1 − p023 | {z } } X |

p14 − p34 u0 p24 − p34 v0 p014 − p034 u0 p024 − p034 v0 {z b

   

(9)

}

This overdimensioned system accepts least squares solution: X = (At A)−1 At b

(10)

Then, given the correspondence defined by the points (u0 , v0 ) and (u1 , v1 ) we have finally reached 3D coordinates of their corresponding point. Let X represent these coordinates.

4

Camera Calibration. Faugeras’ Method

The calibration method we use in our computer vision application derives from the Faugeras’ theory of calibration. It’s based on a multiplanar calibration pattern, which makes it a robust method. The algorithm which implements the method has been developed by J.Andrade [1] and A.Kosaka and K.Rahardja [10]. It’s interesting to consider that only two coordinate systems are used in this calibration method, one is related to the world and the other to the image plain. Neither lenses distortion nor the posible error in converting CCD coordinates (in metric units) to screen coordinates (in pixel units) are considered. All the same, results are satisfactory, and errors in depth estimation are lower than 5%. To calculate the perspective projection matrix P we need to define relations between 3D coordinates of a world point and 2D coordinates of that point in the image. If we can write this relation as lineal, we could use the least squares method to find an expression for P. Calibration process is based on a set of calibration points Mi which world coordinates are known and represented as (xwi , ywi , zwi ) in an absolute coordinate system, and as (ui , vi ) in an image coordinate system. Let Q be a candidate to perspective projection matrix

11



q11 Q =  q21 q31

q12 q22 q32

q13 q23 q33

  > q1 q14 q24  =  q2> q34 q3>

 q14 q24  q34

(11)

For each Mi point can be written:     xi    > q1 q14  Ui  Ui = q1> · (xi , yi , zi ) + q14   Vi  =  q2> q24   yi  ⇒ Vi = q2> · (xi , yi , zi ) + q24  zi   > si q3 q34 Si = q3> · (xi , yi , zi ) + q34 1 thus

ui =

Ui q1> · (xi , yi , zi ) + q14 = > si q3 · (xi , yi , zi ) + q34

(12)

vi =

q2> · (xi , yi , zi ) + q24 Vi = > si q3 · (xi , yi , zi ) + q34

(13)

If expressed as lineal: xi · q11 + yi · q12 + zi · q13 + q14 − ui xi · q31 − ui yi · q32 − ui zi · q33 − ui · q34 = 0 xi · q21 + yi · q22 + zi · q23 + q24 − vi xi · q31 − vi yi · q32 − vi zi · q33 − vi · q34 = 0 For N calibration points, we get a 2N-dimensional system of homogeneous equations as A·Q=0

(14)

where Q = [q11 , q12 , q13 , ..., q34 ]> is a vector of unknowns, and A2N ×12 is a matrix composed of 2D and 3D calibration points coordinates. If calibration points are not coplanar it can be proved that the rank of A is 11, that is, not all of its columns are independent. As a result, the trivial solution Q = [0, 0, ..., 0]> is not the only possibility. By way, it must be considered that points needed in the calibration process have to be greater than six, and to find the solution we must also consider some restrictions related to the nature of perspective projection matrix. So that Q represents a perspective projection matrix, two conditions need to be held: kq3 k = 1

(15)

(q1 × q3 ) · (q2 × q3 ) = 0

(16)

Therefore, the problem consists in solving the system of equations A · Q = 0 restricted by (15) y (16). 12

If we begin applying the restriction given by (15), we are able to separate the Q vector into subject to restriction and non subject to restriction coordinates: 

         Q=         

q11 q12 q13 q14 q21 q22 q23 q24 q31 q32 q33 q34



                 = S1 ·                 |

q11 q12 q13 q14 q21 q22 q23 q24 q34 {z

qrest



        +S2 ·     |   

 q31 q32  q33 {z }

(17)

q3

}

where S1 ∈ M12×9 and S2 ∈ M12×3 . Solving A · Q = 0 has no sense, since there exist a lot of error sources (precision in points, calculation errors...). That’s why the most reasonable option is to minimize kA · Qk2 subject to kq3 k = 1 restriction. If Q = S1 · qrest + S2 · q3 then it can be written: A · Q = A · S1 · qrest + A · S2 · q3 = C2N ×9 · qrest + D2N ×3 · q3

(18)

The task at hand is to solve the following constrained minimization problem: minkC · qrest + D · q3 k2

(19)

subject to kq3 k = 1 Using Lagrange’s multipliers method, the problem is equivalent to minimize: R = kC · qrest + D · q3 k2 + λ · (1 − kq3 k2 )

(20)

And the solution yields: qrest = −(C T C)−1 C T D · q3 T

T

λq3 = Bq3 = D (I − C(C C)

−1

T

C )D · q3

(21) (22)

The expresion (21) means q3 is an eigenvector with eigenvalue λ of B matrix. It can be proved that to minimize (19), q3 has to be choosen as the minor eigenvalue of the B eigenvectors. Once q3 is calculated, it’s possible to obtain qrest .

13

The other restriction left is given by (16). However, if the coordinates system is orthogonal the condition is therefore satisfied inmediately and it becomes a non restriction. When perspective projection matrix P is obtained, we can deduce extrinsic and intrinsic parameters. Nevertheless, it’s not necessary in this application. In this application we use the implementation of the algorithm developed by J.Andrade [1]. The algorithm is based on Faugeras’ method and makes use of a calibration pattern owning 225 points distributed in a square and plane surface. To get measures in different planes the pattern has to be placed in 3 different positions along Zw axis of the world coordinates system, so that total number of calibration points amount to 675. See Figure 4.

Figure 4: Calibration method For each image placed on Zwi depth, it’s calculated image plain coordinates of the calibration points. The steps followed by the method are: 1 Image binarization for extracting the black points in the pattern. 2 Calculating the gravity center for each point ((ui , vi ) point position) 3 Determining CCD position of the center point in the pattern, which is assigned (0, 0, Zwi ) position in the world coordinates system. 4 From the center point, the remaining position points are designed, since relative placement between pattern points is known. To check calibration efficiency we can reproject each calibration point, which position in the world is known, onto the image. That is, through the point coordinates (xwi , ywi , zwi ) in the absolute coordinates system, we calculate the 14

coordinates (uireproj , vireproj ) in the screen, and compare them with the coordinates (ureal , vireal ) they would have. i A measure of the calibration quality can be obtained from 2D reprojection error definition given as: q 2D reprojection error = (ureal − uireproj )2 + (vireal − vireproj )2 [pixels] i The following histrogram represents 2D error evaluated for the 675 points of each experiment:

Figure 5: 2D Error histogram The 2D reprojection error in this calibration method has a mean of 0.93 and a standard deviation of 0.47 for similar depth levels. Finally, if we were to extended this formulation then a 3D reprojection error can be defined: q reproj 2 real − y reproj )2 + (z real − z reproj )2 [mm] 3D reprojection error = (xreal ) + (ywi wi − xwi wi wi wi This 3D error is really representative in stereo vision, where two cameras move together. Using in optical flow analysis does not make much sense, because reprojection error depends on measurement devices error in calculating motion. Nevertheless, if we take two calibration images and reconstruct the calibration points in 3D coordinates as described in the last section, then we will be able to calculate a 3D reprojection error. Since this error depends on an error when measuring the motion, we minimize the 3D reprojection error for this case. That is, if the sensor provides a mesaure of displacement of 10 cm, we do not consider measurement error and minimize the 3D reprojection error for a range of values around 10 cm. Then, the minimun 3D error could be in 9,9 cm for example, and therefore the measurement error is eliminated. The results are shown in Figure 6. If we eliminate the most remote points from the camera, the mean of 3D reprojection error is 41.44 mm. and the standard deviation 22.68 mm.

15

Az:=−50, El:=22

Az:=0, El:=0 1000 800

1000

600

800 600

400

400

200

200

0

Z

Z

0 −200

−200

−400 −600

−400

−800

−600

−1000 500

200

−800

100

0 0 −500

−1000 −100

−100

Y

0

100

200

X

X

Az:=−90, El:=0

Az:=−90, El:=90

1000

200

800

150

600 400

100 X

Z

200 0

50

−200

0 −400 −600

−50

−800 −1000 500

0 Y

−100 500

−500

0 Y

−500

Figure 6: Real Pattern vs. Reprojected Pattern Views [milimeters].

5

Feature Detection and Feature Tracking

In the first section we expressed our interest in simplicity, and we pointed out the time processing problem as the main reason for this. Now we will justify that idea. If we are able to find in an image n representative points or n ’good’ feature points of a scene, and we are able to make correspondences between frames i and i+1, then we will be able to calculate 3D coordinates of each good point. What is more, if we know the coordinates for these good points, we will be able to decide, based on a criteria, if there exists an obstacle in the scene. In this sense, complexity is avoided by reducing 3D reconstruction of points to a 3D reconstruction of n good points. Consequently, time processing will be reduced by using n points to infer dimensions instead of reconstructing m pixel points in each pair of images, where m is the size of the image. In other words, we use the 3D reconstruction of n interesting points of each image to decide if an object could be considered as an obstacle. There are two main parts to this task. The first is feature detection and the second is feature tracking. Feature detection consists in finding locations in a 16

single image that are considered ”good” features. Examples of good features include edges, corners or color differences. The feature tracking makes use of coherence between frames to track known features in one frame to the unknown corresponding positions in the following frame. Now, we focus on a brief review of feature detection methods, and later we introduce the KLT tracker we use in our application to obtain these n good points.

5.1

State of the Art

In scene analysis, the goal of low level vision is feature extraction. In particular, vertex, corners and edge features are very interesting when making correspondences between points in different frames. One of the most popular techniques is edge detection. There exists a mirage of methods for edge detection on images, since it is a primary element of computer vision technology. An expansive literature has been developed from Sobel, Prewitt, or Roberts masks based upon the first-order derivatives of images, or the LOG detector that uses second-order derivatives, to a family of methods for edge extraction based on the detection of extremes in the output of the convolution of the image with an impulse response, namely the Canny, Deriche, and Spaceck detectors. Another popular salient feature extracted from images are corners. By corners, we mean those points in an image where there is a steep change in intensity in more than one direction. Such points usually correspond to physical vertices on windows, walls or furniture in the scene, to their shadows, or to any other point with high curvature over any type of surface. Detection of corners or vertices are very important because these features are often used to identify objects in the scene or used for stereoscopic matching or displacement vector measuring. So, accurate localization of these features is of great interest. Years ago, to obtain these features, curvature analysis along edge chains had to be used, searching for points with high curvature along edges. These techniques, which relied on the results of preceding edge extraction modules, are no longer popular due to their high computational cost. More recent methods for corner detection are based on the direct computation of the gradients and on the curvature of an image. One of the first functions used to characterize corner response that did not require a previous edge extraction step is due to Beaudet, who proposed a rotationally invariant operator for corner extraction. This operator was derived from the second order Taylor series of the image intensities I(x, y), and consisted in the computation of the determinant of the Hessian for each image point. Variations on this approach include that of Kitchen and Rosenfeld, Dreschler and Nagel, and Harris and Stephen. See Figure 6. All of these operators require the computation of second order derivatives on the image, which makes them not only very noise sensitive, but also inaccurate for precise localization. Nevertheless, these functions are usually used to obtain initial corner localization estimates, and to refine these estimates, several 17

heuristics can be used, such as multiresolution, non-maximum suppression, and energy minimization. In most cases, the vertex features found with the algorithms described in the last paragraphs will correspond to geometrical entities from the environment, such as corners on windows, walls or furniture, or their reflections. Unfortunately, the tracking from one frame to the next of such image features might still be hard to attain. Affine deformations caused by the change in viewpoint, or by the variation of the reflectance conditions contribute to such difficulty. With that in mind, Shi and Tomasi formulated an image feature selection algorithm optimal by construction from the equations of affine motion [13].

Figure 7: Beaudet, Kitchen and Rosendfeld, Shi, and Harris and Stephen features. Ordered from left in clock direction. Images thanks to J. Andrade.

5.2

KLT tracker

KLT is an implementation, in the C programming language, of a feature tracker that will hopefully be of interest to the computer vision community. The tracker is based on the early work of Lucas and Kanade and was developed fully by Tomasi and Kanade, but the only published, readily accessible description is contained in the paper by Shi and Tomasi. Recently, Tomasi proposed a slight modification which makes the computation symmetric with respect to the two images; the resulting equation is fully derived in the unpublished note by Birchfield [4]. Briefly, good features are located by examining the minimum eigenvalue of 18

each 2 by 2 gradient matrix, and features are tracked using a Newton-Raphson method of minimizing the difference between the two windows. Multiresolution tracking allows for even large displacements between images. The affine computation that evaluates the consistency of features between non-consecutive frames was implemented by Thorsten Thormaehlen several years after the original code and documentation were written. The KLT algorithm proceeds as follows: 1 Blur the image with a Gaussian filter. The size of this filter can be altered in order to detect features at different scales. 2 Compute gradients. The gradient is computed for each pixel for both the x and y axes. 3 Compute 2x2 gradient matrix for each pixel. The 2x2 gradient matrix is computed for each pixel by looking at an ajustable window around the pixel. The matrix, also known as the Hessian matrix, is a measure of the curvature of the intensity surface at the pixel. A larger window means that the curvature is measured over a larger basis. The matrix can be computed by adding the squared derivatives (dx*dx, dy*dy) as well as (dx*dy) at each pixel in the window. 4 Find eigenvalues for each 2x2 gradient matrix. As mentioned earlier, the 2x2 gradient matrix is a measure of the curvature. The eigenvectors of this matrix represent the directions of minimum and maximum curvature. The eigenvalues represent the amount of curvature in those directions. 5 Keep pixels with largest minimum eigenvalue. The minimum eigenvalue is the amount of curvature in the direction of minimum curvature. This is the value which is used to decide which features are ”good.” Good features are those which have the largest value for the minimum eigenvalue. If we wanted the 100 ”best” features, we could keep the 100 pixels with the largest valeus for this eigenvalue. The algorithm outlined above returns the ”best” features in a single image. These are the candidate features which we would like to use when tracking correspondences across sequences of images. On the other side, say about feature tracking that to solve for the correspondences, a key observation concerning the input is very important. The input is a video sequence, which means that two sequential frames in the sequence are nearly identical. The difference between sequential frames is due to the cameras movement. This movement is very small because the time step between frames is 1/30th of a second. Because sequential frames are so similar, corresponding features between frames will have very similar pixel coordinates. The algorithm presented in the feature tracking implementation section makes use of this coherence between frames to track known features in one frame to the unknown corresponding positions in the following frame.

19

5.3

KLT Implementation

The feature detection and tracking algorithm is implemented in C. The input to the program is a sequence of PPM format images, and the output from the program is a text file containing the pixel coordinates for each track feature in every frame of the sequence. The program works on pairs of images from the sequence. For example, the program tracks features from frame 1 to frame 2, then from frame 2 to frame 3, and so on. This means that once we have a starting point or a base case, we can then just work on tracking features from frame i to frame i+1. The program starts by finding the best features in frame 1 using the KTL algorithm described in the last section. The features detected in this step are then shown to the user by highlighting the pixels with a colored box. These are the potential features to track. The user must then click on the features that should be tracked throughout the sequence. These are called active features. Once the active features are known in frame 1 (the base case), the program starts processing pairs of images. This function of the program has been modified in our particular version of ’KLT obstacle detector’ in order to calculate 3D points reconstruction and to apply a decision criteria. The output from the original program has been eliminated and used to calculate, given perspective projection matrix P and a measure of displacement between frames, the 3D coordinates of good feature points. For each good feature point, the 3D coordinates are calculated and the decision criteria is applied. The output is now an alert when an obstacle is detected. We will discuss about the decision criteria in the experimental results section. The rest of the program works on frames i and i+1. It assumes that the pixel coordinates for the active features in frame i are known. Next, the program performs feature detection on frame i+1 to determine the location of potential features. As noted in the last section, the corresponding features in sequential frames have very similar pixel coordinates. This fact is exploited by reducing the search for a correspondence to each active feature. For each active feature, a correspondence is found by starting the search in frame i+1 at the active feature’s pixel coordinate from frame i. The seach is constrained to a window around this pixel. The size of the window is set relative to the expected speed in which the features move in pixel coordinates. If just one potential feature is inside this window, it is assumed to be a correspondence. If multiple potential features are inside this window, the program performs an image difference between neighbourhoods to find the best matching feature. This best match is marked as the correspondence. Once the correspondences have been found for all active features, the algorithm repeats on frames i+1 and i+2. The errors in tracking can come from a number of different sources. The two most common are blurry images that make image differencing less accurate and camera motion that is too fast for the window size. The problem of fast camera motion means that either zero potential features are in the window or 20

only incorrect potential features are in the window. In our obstacle detection system we will not have the last problem, since the time step between frames will be longer than the 1/30th of a second for a video sequence. Our frame rate will be lower in order to let the program proccess and calculate all the 3D coodinates for the set of points. The number of points to be tracked and time step between frames is configurable, and that’s why time processing will not be a problem. The amount of information to be proccessed will be choosen depending on the precision needed and the speed of the available devices.

6

Experimental Results

In the last sections they have been described the necessary mathematical tools to develop an obstacle detection system based on a computer vision application. We began to expose the most common camera model, the pin hole model. It defined the representation of a point in the camera image plain as a simple linear transformation of the 3D world coordinates of a point. The camera calibration proccess precisely consists in calculating this transformation. A method for this was shown, in particular Faugeras’ method. After having the calibration process, we showed how to calculate 3D coordinates of a world point given its corresponding pixel coordinates in two images. These two images are taken by a camera undergoing known motion. To solve this correspondence problem we presented a light view of correspondence methods. They focused on supplying the corresponding coordinates of a same point in different images. But these points were not arbitrary points, we named ’interesting points’. By interesting points we mean they can provide the best information about scene objects dimensions. For example, the highest point of an object in the scene is an interesting point, since it models the objects height. If we find enough interesting points to characterize all the objects height, when we reconstruct these points in the world we will know how far and how high each object is. Then an object will be considered an obstacle if its distance to the camera is greater than a distance threshold and its higher than an height threshold. Defining these thresholds is what we refered to fix a decision criteria. Now we can understand this decision criteria, that is very simple and configurable: we define a region in the scene, if some of these interesting points are in it, then an obstacle exists. In this section we describe the steps followed to implement this obstacle detection system. To make the experiments shown in the next pages we fixed the camera on a movable rail. The displacement of a robot or car provided by this system have been simulated by moving the camera along the rail. Nor the ilumination of the scene nor the measures of displacement have been specially cared for making easy the work of this system. Once the camera is installed, the perspective projection matrix P has to be

21

calculated for these scene conditions. After this, just to take images moving the camera along the rail. We have to point out that the range of movement, about some centimeters, is justified by our interest in exposing the system to a hard restriction. Supposing we have a slow processor which needs some fractions of a second to calculate 3D reconstruction between a pair of images (it’s an exaggeration, since we have estimated this time about some miliseconds), and our measure displacement error is so high that it is in the range of centimeters. Despite these hard working conditions, we can see the system behaves in a positive way. The program steps are in Figure 8. CALIBRATION

CAPTURE IMAGE 1 AT t

VEHICLE DISPLACEMENT

(dx,dy,dz)

CAPTURE IMAGE 2 AT t + to

FIND 'THE BEST' FEATURES IN THE IMAGE 1

FIND THE CORRESPONDENCES IN THE IMAGE 2

FOR EACH CORRESPONDENCE CALCULATE ITS (X,Y,Z)

DECISION

CRITERIA

OBSTACLE?

ALARM

Figure 8: Application steps The original images used by the application are shown in the experiments. Also the features tracked from one image to another, and the 3D reconstruction of these points. For feature selection we choose the points which eigenvalue is a percent of the 22

maximun eigenvalue, which is the first in being reconstructed. Then the number of points will depend on the value of this percent. We refer to it as eigenvalue ≥ X%. These selected features are highlighted in red on the corresponding images. Finally, the goal of these experiments is to prove that object dimensions are detected. That’s why we do not need to refer anymore to the decision criteria in the next pages.

23

6.1

Experiment 1

Image Pair Number [1] eigenvalue ≥ 0.5% Resulting number of points tracked: 30 dx=+55 [mm] Resulting points in the scene: 11

Figure 9: Real Image at t0

Figure 10: Detected Features at t0

Figure 11: Real Image at t1

Figure 12: Detected Features at t1

Real Image Pair [1] and Detected Features. Experiment [1]

24

Az:=−21 El:=12

Az:=0 El:=0 30

30

25

25 20

20

15

15

Z

Z

10 5 0

10 5

−5

0

−10

−5

10 0 −10 −10

−5

0

Y

10

5

15

20

25

−10 −10

−5

0

5

10

15

20

25

X

X

Az:=−90 El:=0

Az:=−90 El:=90

30

25

25

20

20

15

15 X

Z

10 10

5 5 0

0

−5

−5 −10 15

10

5

0 Y

−5

−10

−10 15

−15

10

5

0 Y

−5

−10

Figure 13: 3D Detected Features Views [centimeters]. Experiment [1]: Eigenvalue ≥ 0.5% max. In this experiment the points taken as ’good points’ have an eigenvalue greater or equal to 0.5% of the max eigenvalue obtained. The larger an eigenvalue is, the higher the probability of a correct correspondence between a pair of image points is. In this sense the reliability of tracked points can be configured to an acceptable work of the system. We can see in this experiment that a lot of points are tracked for a limit of 0.5%, so that a lot of correspondences are found. But the problem could be that if this limit is too low, then we could track some incorrect points and the probability of an inexistent object is therefore higher. In this experiment, if the work scene is reduced to z ≥ 0 then there’s no error in detecting a wrong object. This is shown in Figure 13 as a box of about 6 centimeters is detected. Dispersion is due to a radial distorsion in the camera and in calculation errors. In the next experiment the eigenvalue is greater or equal to 5%. It’s hoped that a lower amount of correspondences are accepted as good. 25

−15

6.2

Experiment 2

Image Pair Number [1] eigenvalue ≥ 5% Resulting number of points tracked: 8 dx=+55 [mm] Resulting points in the scene: 4

Figure 14: Real Image at t0

Figure 15: Detected Features at t0

Figure 16: Real Image at t1

Figure 17: Detected Features at t1

Real Image Pair [1] and Detected Features. Experiment [2]

26

Az:=−21 El:=12

Az:=0 El:=0 30

30

25

25 20

20

15

15

Z

Z

10 5 0

10 5

−5

0

−10

−5

10 0 −10 −10

−5

0

Y

10

5

15

20

25

−10 −10

−5

0

5

10

15

20

25

X

X

Az:=−90 El:=0

Az:=−90 El:=90

30

25

25

20

20

15

15 X

Z

10 10

5 5 0

0

−5

−5 −10 15

10

5

0 Y

−5

−10

−10 15

−15

10

5

0 Y

−5

−10

Figure 18: 3D Detected Features Views [centimeters]. Experiment [2]: Eigenvalue ≥ 5% max. In this experiment the points taken as ’good points’ have an eigenvalue greater or equal to 5% of the max eigenvalue obtained. Now, only four points are accepted as ’good points’. The probability of a wrong point is lower, but if the limit is too high then a few points are reconstructed and may not have enough information to decide if an object exists. Then, we can say that if the eigenvalue limit is too high, then no points are considered as good and it’s imposible to detect an object. However, if the eigenvalue limit is too low, a lot of points are detected and could be difficult to decide if an object exists too. Some more experiments are shown for a more complex scene. As a result, we can confirmate the idea we have extracted from these experiments.

27

−15

6.3

Experiment 3

Image Pair Number [2] eigenvalue ≥ 0.5% Resulting number of points tracked: 25 dx=-55 [mm] Resulting points in the scene: 13

Figure 19: Real Image at t0

Figure 20: Detected Features at t0

Figure 21: Real Image at t1

Figure 22: Detected Features at t1

Real Image Pair [2] and Detected Features. Experiment [3]

28

Az:=−21 El:=12

Az:=0 El:=0 50

50

40

40 30

30

20 Z

20

Z

10 0

10

−10

0

−20 20

−10 0 −20

−20

0

−10

Y

10

20

30

40

−20 −20

−10

0

X

Az:=−90 El:=0

10 X

20

30

40

Az:=−90 El:=90

50 40 40 30 30 20

Z

X

20 10

10

0

0

−10

−10 −20

20

10

0 Y

−10

−20

−20

20

10

0 Y

−10

−20

Figure 23: 3D Detected Features Views [centimeters]. Experiment [3]: Eigenvalue ≥ 0.5% max. In this experiment one more object is introduced into the scene. If the eigenvalue limit is 0.5% we can see that the new object is detected. Nevertheless, in experiment 4, if the eigenvalue limit is fixed to 5% then the new object is not detected. We can extract two ideas from this. It goes without saying that problems in detecting objects become greater when the object color is very similar to the ground of the scene. (Note that the new object is black as the ground). The other conclusion is that the more complex a scene is, the less eigenvalue limit needs to detect objects.

29

6.4

Experiment 4

Image Pair Number [2] eigenvalue ≥ 5% Resulting number of points tracked: 7 dx=-55 [mm] Resulting points in the scene: 4

Figure 24: Real Image at t0

Figure 25: Detected Features at t0

Figure 26: Real Image at t1

Figure 27: Detected Features at t1

Real Image Pair [2] and Detected Features. Experiment [4]

30

Az:=−21 El:=12

Az:=0 El:=0 50

50

40

40 30

30

20

Z

20

Z

10 0

10

−10

0

−20 20

−10 0 −20

−20

−10

0

Y

10

20

30

40

−20 −20

−10

0

X

Az:=−90 El:=0

10 X

20

30

40

Az:=−90 El:=90

50 40 40 30 30 20

Z

X

20 10

10

0

0

−10

−10 −20

20

10

0 Y

−10

−20

−20

20

10

0 Y

−10

−20

Figure 28: 3D Detected Features Views [centimeters]. Experiment [4]: Eigenvalue ≥ 5% max. We can see that the new object introduced into the scene is not detected because the eigenvalue limit is too high.

31

6.5

Experiment 5

Image Pair Number [3] eigenvalue ≥ 5% Resulting number of points tracked: 78 dx=+65 [mm] Resulting points in the scene: 62

Figure 29: Real Image at t0

Figure 30: Detected Features at t0

Figure 31: Real Image at t1

Figure 32: Detected Features at t1

Real Image Pair [4] and Detected Features. Experiment [7]

32

Az:=−21 El:=12

Az:=0 El:=0 60

60

50

50 40

40

30

30

Z

Z

20 10 0

20 10

−10

0

−20 50

−10 0 −50

−40

−60

0

−20

Y

20

40

60

−20 −60

−40

−20

0 X

X

Az:=−90 El:=0

20

40

60

Az:=−90 El:=90

60

60

50 40 40 20

20

X

Z

30

0

10 −20 0 −40

−10 −20

60

40

20

0 Y

−20

−40

−60

−60

60

40

20

0 Y

−20

−40

−60

Figure 33: 3D Detected Features Views [centimeters]. Experiment [7]: Eigenvalue ≥ 1.75% max. In this last experiment the eigenvalue limit has been set to 1.75% in order to detect all the objects in the scene. This shows that the eigenvalue limit depends on image complexity, and it does not look like a trivial decision in the design of our obstacle detection system. As bad is detecting too much points (not high fiability points) and decide an object exists when it really does not exist, as not detecting enough points and so do not alert when an obstacle exists. After having analyzed these experiments it becomes clearer that deciding the eigenvalue limit is not a trivial task. And it does not usually work good in all the scene conditions. That’s why we think that stadistical analysis is neccesary to decide with success. It’s become a stadistical problem. However, it’s beyond the limits of this report and will be proposed as future work.

33

6.6

About Errors in Height Estimation

In the Faugera’s calibration section the 3D reprojection error was calculated, which owns a mean of 41.44 mm. and a standard deviation of 22.68 mm. That is, if we reproject an arbitrary point, we can be making an error about 4 cm in mean. If the height of the point is big enough, the relative maked error is despicable. To finish this report we have to answer a final question: How much error am I making in height estimation if I make an error of 10% for example in distance measurement? To shed light on the matter imagine there exists in the scene an object which height is x meter. If I make an error of 10% in measuring the movement of the robot or vehicle which owns this obstacle detection system, then which is the range of the object height obtained? The results for x = 1 m. and x = 0.5 m. are in the next figures. Az:=−21 El:=12

Az:=0 El:=0 150

150

100

Z

Z

100

50

50

0 50

0 −60

0 −50 −60

−40

0

−20

Y

20

40

60

−40

−20

0 X

20

40

60

X

Az:=−90 El:=0

Az:=−90 El:=90

150

60

40

100

Z

X

20

50

0

−20

−40

0

60

40

20

0 Y

−20

−40

−60

−60

60

40

20

0 Y

−20

−40

Figure 34: In red color for dx=145. In green color for dx=145±10% [centimeters] In the case of the object which real height is 1 m., the obtained value for a 34

−60

Az:=−21 El:=12

Az:=0 El:=0 150

150

100

Z

Z

100

50

50

0 50 0 −50

−40

−60

0

−20

Y

20

40

60

0 −60

−40

−20

0 X

X

Az:=−90 El:=0

20

40

60

Az:=−90 El:=90

150

60

40

100

Z

X

20

50

0

−20

−40

0

60

40

20

0 Y

−20

−40

−60

−60

60

40

20

0 Y

−20

−40

Figure 35: In red color for dx=100. In green color for dx=100±10% [centimeters] 10% error in distance measurement is 1 ± 14.7% m. Nevertheless, in the case of the object with 0.5 m. is 0.5 ± 20% m., which shows that the higher an object is the smaller the error in height estimation. That’s why each case in particular has to be studied depending on the work conditions.

35

−60

Conclusion and Future Work In this report we have described an implementation of an obstacle detection system based on a computer vision application. All the steps follow the development of the final design and have been analyzed with care. The result is a system which reconstructs in the 3D space a variable amount of ’interesting points’ of the scene and, once we have a set of these points we decide based on a decision criteria if an obstacle exists. This last task may need a more profound analysis and can be studied with special interest because the succes of this system is very dependent on this criteria. This analysis has turned into a stadistical problem, and complex mathematical tools will be neccesary in the treatment of this information. This stadistical treatment is proposed as future work in order to find an optimum eigenvalue limit for each moment.

36

References [1] Andrade-Cetto, J. and Sanfeliu, A., Integration of perceptual grouping and depth, Proc. IAPR, Volume 1, September 2000. [2] Andrade-Cetto, J., PhD Thesis: Enviroment Learning for Indoor Robots. Institut de Rob` otica i Inform` atica Industrial, UPC-CSIC. ISBN 84-688-2339-2. [3] Andrade-Cetto, J., Technical report: Camera Calibration. Institut de Rob` otica i Inform` atica Industrial, UPC-CSIC. IRI-DT-01-02. [4] Birchfield, S., Computer Program: An implementation of the Kanade-LucasTomasi feature tracker. Standford Vision Laboratory. [5] Forsyth, D. and Ponce, J., Computer vision: a modern approach. Prentice Hall. ISBN 0-13-085198-1. [6] Giraudon, G. and Deriche, R., Rapports de Recherche: On corner and vertex detection. RR-1439. Institut National de Recherche en Informatique et en Automathique. [7] Mitran, M. PdH Thesis: Active Surface Reconstruction from Optical Flow. Department of Electrical and Computer Engineering. McGill University, Montreal. [8] Moreno, F., Projecte Fi de Carrera Desenvolupament d’un Sistema d’estereovisi´ o per un robor m` obil. Aprovat el 2 de Mar¸c del 2001. Universitat Polit`ecnica de Catalunya, UPC. [9] Owen, R., Web publication: Computer Vision IT412. Department of Computer Science, School of Informatics, University of Edinburgh. [10] Rahardja, K. and Kosada, A., Automatic Camera Calibration. RVL Memo 37. Robot Vision Laboratory. Purdue University. March 1995. [11] Ricolfe, C. and S´ anchez, A.J. and Simarro, R., Report: T´ecnicas de calibrado de c´ amaras. Departamento de Ingenier´ıa de Sistemas y Autom´ atica, Universidad Polit´ecnica de Valencia. [12] Sebasti´ an, J.M., Transparences Detecci´ on de Esquinas y V´ertices. Departamento de Autom´ atica, Ingenier´ıa Electr´ onica e Inform´ atica Industrial, UPM. [13] Shi, J. and Tomasi, C., Good Features to Track. IEEE Conference on Computer Vision and Pattern Recognition. (CVPR 94) Seattle June 1994. [14] Tomasi, C. and Kanade, T., Detection and Tracking of Point Features. Technical Report CMU-CS-91-132. [15] Torras, C., Computer Vision: Theory and Industrial Applications. Editorial Springer-Verlag. ISBN 0-387-52036-8. [16] Ying Wu, Camera Model and Image Formation. ECE510-Computer Vision Notes Series 2.

37

Acknowledgements This report was carried out during my stay in the Institut de Rob` otica i Inform` atica Industrial in Barcelona in the Summer of 2004 at the age of 21. It was directed by Dr. Josep Amat Girbau, Cathedratic Professor at the Universitat Polit`ecnica de Catalunya, Spain. Many thanks to Guillem Aleny` a for his indispensable help, and to Juan Andrade for all the information provided.

38