Motion detection and tracking using belief

summarized and discussed in the fourth section. 2. Motion ... Several efficient algorithms taking into account first- order signal ... objective is to control temporally the background ..... tracking”, in CVPR'99, , Ed Collins,CO, 1999,. 246-252.
382KB taille 3 téléchargements 309 vues
Motion detection and tracking using belief indicators for an automatic visual-surveillance system Cina Motamed Université du Littoral Côte d'Opale, Laboratoire LASL 50 R. F. Buisson Bat. B, 62228 CALAIS, FRANCE [email protected] motorways). Because of the fast evolution in the fields of data processing, communications and instrumentation, such applications have become Abstract possible. A motion detection and tracking algorithm for An automatic visual-surveillance system human and car activity surveillance is presented and generally contains several main hierarchical modules. evaluated by using the Pets’2000 test sequence. The motion detection, the tracking procedure, objects Proposed approach uses a temporal fusion strategy by classification and finally high level motion using the history of events in order to improve interpretation. Main surveillance applications concern instantaneous decisions. Normalized indicators object motion indexing, activity recognition and updated at each frame summarize history of specific incidents detection. events. For the motion detection stage a fast updating algorithm of the background reference is proposed. The majority of real applications use a fixed The control of the updating at each pixel is based on camera. Vision modules for visual surveillance are a stability indicator estimated from inter-frame optimized in order to contribute to the real time variations. The tracking algorithm uses a determinist response of the global system. The performance of region based approach. A belief indicator such a system is generally context-dependent and representing the tracking consistency for each object they may be compared on a specific application with allows to solve defined ambiguities at the tracking respect to: level. A second specific tracking indicator * the quality of decisions, representing the identity quality of each tracked * trade-off between quality and computing object is updated by integrating objects interaction. time, Tracking indicators permit to propagate uncertainties * the robustness with respect to small on higher levels of the interpretation and also are perturbations. used directly in the tracking performance evaluation. Keywords : Video-surveillance, motion detection, tracking, belief indicator, belief updating, uncertainty management 1. Introduction The development of vision systems carrying out the monitoring of sites is an interesting field of investigation. Indeed, motivations are multiple and concern various domains as monitoring and the surveillance of significant protected sites, control and estimation of flows (car parks, airports, ports, and

In fact visual-surveillance is a vision application with a high degree of dependence to the context. In many applications a top-down approach has to be envisaged by modeling what we want to observe and the contextual knowledge has to be well highlighted in order to be integrated at each level of the interpretation system [2][3][9]. Tracking is an important task of computer vision with many applications in surveillance, scene monitoring, navigation, and video database management. In vision community four main approaches exist:

- Deformable 2D model permits to model individually the frontier of each object from image [1]. This approach is particularly sensitive to its initialization when position and size of objects are unknown. It also demands isolated object and continuous variation of available observations. - 3D model-based tracking uses a constant or parametric model of the known objects [7]. This top down approach is particularly efficient for resolving occlusions in the 3D space. The difficulties appear essentially over the complexity of the model generation and the initialization for the recognition step by a single camera. - The feature based methods tracks features as lines or points and are particularly efficient when a segmentation of the scene is not easily available, as when the camera is moving [17]. The principal drawback concerns the problem of grouping to elaborate the object level. -The region based approach is widely used in the visual surveillance field, because it brings many advantages. They are easily identified by using motion information when a fixed background is present [8]. Detected regions contain important information, color, texture, position orientation etc. They are efficient for object tracking because they are associated to the global position and the area of objects. But as other 2D approaches, regions are sensitive in presence of occlusion. This work is concerned by the detection and tracking of objects in an outdoor car park environment using a determinist region-based approach. Many uncertain events are likely to occur when the system has to operate outdoors and in unconstrained scenes containing multiple objects. One important aspect is firstly the need to understand uncertainty in terms of sources of uncertainty. Because generally an appropriate way of uncertainty management is to focus on its source. The main difficulties concern: illumination changes and shadow artifacts at the detection level, and also occlusions at the tracking level. These problems are very harmful for surveillance tasks because they induce target missing, false detection and object identity confusion. It is important for visual-surveillance algorithms to have the capacity to manage efficiently the

uncertainty of its decisions. A common technique for representing uncertainty is to qualify all decisions by attaching a belief indicator representing a confidence factor. Two main principles of uncertainty management exist. The principle of minimum uncertainty is commonly used in estimation theory. The strategy is to discard the less likely conclusions in order to highlight the best ones. On the other hand, the principle of maximum uncertainty is essential for any problems that involve secure reasoning. It guarantees that the ignorance has been fully recognized and propagated. Our approach is mainly based on the second principle. In fact for a surveillance system, it is crucial to incorporate the notion of “positive security” by an autonomous recognition of its possible inability to decide. For example, if the tracker can not maintain efficiently the identity of a set of objects over their tracks, it is more cautious for the high level interpretation to propagate all possible alternatives in order not to lose real critical events. A major drawback of such secure approach is that it may increases the false alarm rate. Physical events observed by visual surveillance systems are typically continuous and by accumulating evidence from successive observations, decisions may be improved by focusing on consistent events. This evidence accumulation approach is also called temporal fusion. We have tried to integrate it into the motion detection and tracking modules. In the second section, we present the motion detection process. It is based on a difference between the current image and a reference image. In the third section we expose the tracking algorithm. All presented algorithms are evaluated over the “test sequence” proposed in the framework of the PETS’2000 (IEEE Workshop on performance Evaluation of Tracking and Surveillance). This test sequence contains some of typical ambiguities often appearing in video surveillance applications. Algorithms are tested with an image resolution of 384x288 pixels. Global tracking results are summarized and discussed in the fourth section. 2. Motion Detection Process 2.1. Introduction For fixed cameras the standard motion detection approach is to model the stationary background. In this case, the moving objects representing the foreground are easily extracted by a simple difference of the current image from the background. Motion detection, as every low level

processing steps, affects directly the upper level of the interpretation. It becomes particularly sensitive when the system has to operate with ambient lighting conditions. In many application gray level images have been used and they seem to be sufficient. With the improvement of computational resources, color image processing becomes possible and permits in some way to improve the quality of detection by increasing the global difference between images. A motion detection algorithm is typically composed of three tasks: * Reference background generation The background reference image R can be constructed by updating a model at each new acquisition. Generally it uses a recursive filter which limits image storage. The classical background updating is computed for each pixel p as:

R k +1 ( p ) = a.R k ( p )+ (1 − a ) I k ( p ) (1) Rk and Ik represents respectively the background and the current image at frame k. The parameter ‘a’ controls the updating process. This updating is made generally over the whole image and permits to absorb progressive illumination change. This technique is widely used in highway traffic monitoring where objects continuously move. * Combination operator It computes a distance between the observation and the background reference. * Detection decision for generating the foreground image It is obtained by thresholding of the distance of observation with the reference. if distance( Rk(p),Ik(p) )> η →Dk(p)=1 (2) else →Dk(p)=0; Dk is the foreground image highlighting moving regions at frame k. Rk and Ik represents respectively the background and the current image. The detection threshold η is adjusted automatically or defined by the user. It is generally tuned higher than the amplitude of typical noise. Other more complex threshold

adjustment strategy uses the connectivity of detected regions. 2.2. Proposed detection algorithm The background updating process is an important component of the detection algorithm. Several efficient algorithms taking into account firstorder signal statistics have been proposed [12] [15]. These updating algorithms generally do not explicitly differentiate real static background objects from moving or temporally stopped objects. In applications such as car park surveillance, there are many cases where objects come into the scene and remain. In this situation, detected regions inhibit the possibility of the detection of new objects at the same image location. Initially stopped objects belong to the foreground but over time the updating process includes them as part of the background. This operation is called object integration. The inverse procedure is also useful, a car leaving the car park for example. An important drawback of the global continuous updating algorithm is that the integration time can not be controlled easily. Firstly it depends directly on the object colors. In fact when an object has several colors, some parts of an object can be integrated more rapidly. Secondly, when an object is moving slowly or stops for a short time, and in particular when its color is homogenous over a large surface, the background may be polluted by it. It is important that the process should take explicitly into account the presence of moving objects in order to efficiently control the updating process. A direct control can be obtained from the object level by the updating of tracked objects once they stop. But the drawback of this object level feedback approach is the risk of the propagation of initial motion segmentation errors. Karmann and Brandt have proposed an algorithm for background updating for each pixel by means of a signal processing approach [6]. Each pixel is controlled by a kalman filter. The algorithm adapts quickly the illumination change over the background image and permits a slow adaptation inside the regions affected by moving objects. We also have developed a selective reference-updating algorithm, which takes into account object motion at the pixel level. Its principal objective is to control temporally the background integration [16]. The strategy consists in enabling the updating of each pixel once a stability of observations is felt. The stability concerns the fact that no motion

is detected from a fixed delay. In order to have a recursive algorithm, the stability indicator accumulates the history of the pixel behavior. This indicator is based on the inter-frame difference. Its result is close to the Motion History proposed by [4] (figure 4). At each inter-frame motion detection, the indicator reaches its minimal zero value. After each non-detection the indicator is increased recursively until the value is 1. The indicator elementary step value is defined with respect of a defined time delay and so it depends directly on the working temporal resolution. The time delay called integration delay is tuned from contextual information. The updating is then activated only once the stability indicator reaches the value 1. It permits not to update the recent parts of the background where objects have induced some motion. A stopped object is automatically integrated to the background after the integration delay. If an object of the current background begin to move, some hidden areas are discovered and considered as detected regions. These regions will have their stability indicator increased and consequently they will be quickly become part of the current background. The main advantage of our approach with respect to the Karmann’s algorithm is the control facility of the integration delay by using only the stability indicator and also its lower computational resource consuming.

procedure are applied. Firstly we use standard morphological operations of erosion and dilatation for reducing noise in foreground. Then too small uninteresting regions are removed. In the algorithm this size threshold is defined globally for the entire image. We illustrate the background red channel updating in three specific locations A, B and C (figure 1) between frames 1 to 1400. At the position A a gray car is parked since frame 628 and no updating is made between frames 590 to 635 (figure 2). At the position B three cars have been detected and the location C contains no detection (figure 3) ..

Figure 1: Selected position A, B and C in the scene

The stability indicator S is updated by this following algorithm: if max I ck P ) − I ck −1 ( P ) < ω c = R ,G , B

{

→ if ( S k −1 ( P ) < 1) → S k ( P ) = S k −1 ( P ) + θ

}

(3)

else → S k (P) = 0 } The ω and θ are respectively the inter-frame detection threshold and the stability elementary step. The detection of moving objects is then made by comparing the three RGB components: if ( max

c = R ,G , B

I ck P ) − R ck − 1 ( P ) > ω

→ D k ( P ) = 1;

(4)

else → D k ( P ) = 0;

The threshold ω for the detection is the same used for updating the stability indicator and is defined by the median absolute deviation of image noise computed over a training sequence. The motion detection result presents generally many artifacts. Two kinds of a cleaning

Figure 2: Details of the background updating for gray car in position A (red channel), between frames 500 to 700

Figure 3: background updating in positions: A, B &C (red channel)

generally less parameter and enable an easy incorporation of additional constraints as the contextual ones. 3.2. Proposed tracking algorithm 3.2.1. Tracking strategy

Figure 4: Global image stability indicator and the detection result at frame 150. 3. Tracking 3.1. Introduction The most critical part of a multi-objecttracking algorithm is the data association step. It has to deal with new objects, short of long term disappearing objects and occlusions. Approaches of data association are divided into two global categories. The first category is based on sequential logic, which makes a decision at each frame. The major drawback of the sequential approach is that once decisions are made, they become irrevocable. The second category is based on deferred logic which consider a set of observations before each decision [11], the extreme situation is batch processing. Deferred decisions are unfortunately too expensive for a real time visual tracking algorithm with respect to memory and processing recourses and also functionally unadapted for a use in an online surveillance system. Statistical approaches as the kalman filtering approach with its extensions are widely used in tracking and control [14]. Kalman filtering allows the robust and predictive estimation of state of dynamic system from observations degraded by stochastic errors. It uses state evolution model and an observation model. For multi-object tracking two best known extensions are the JPDAF (Joint Probabilistic Data-association Filter) and the MHT (Multiple Hypothesis Testing). Their major difficulties concern the number of optimal parameters setting. Such as the kalman filter, a priori probabilities of false alarms and missed detection. Heuristic approaches are the second alternative for motion correspondence [13]. These approaches are typically more flexible. They have

Vision based tracking algorithms use cinematic and also visual constraints for establishing correspondence. In our problem of region tracking, observations representing detected regions are complex and are not corrupted by random noise only. They are affected by detection errors, merging and splitting artifacts, which are difficult to model globally. In addition, the evolution model of human motion can be difficult to obtain. For all these reasons our tracking algorithm is based on a heuristic approach. The proposed tracking algorithm uses globally the Nearest Neighbor NN strategy [14]. But in the presence of merging or splitting situations two specific procedures are launched in order to solve the association ambiguity. Each detection region is matched with the best-tracked object. In order to take into account object behavior a first order prediction of each tracked object is used. For each detected region, four features are computed: center of gravity, bounding box, area size, and color histogram. The matching is based on the spatial proximity and the visual distance. For each object in order to reduce region candidates, a standard spatial validation gate is defined. The gate permits to incorporate a cinematic constraint, which limits the acceleration variation for all objects. The algorithm evaluates explicitly the quality of each association. Information is summarized by two specific belief indicators: consistency and identity indicators. These indicators are recursively updated and stored during the tracking step. The consistency indicator of tracked objects permits effective new objects to be validated, and their temporal loss to be tolerated. It reacts as a robust filter at the object level. We consider that this indicator increases when no significant variation in the object features is perceived (bounding box size, area size, speed, histogram). Otherwise or in extreme situations when the target is lost, indicator is decreased. The detection of significant variations of object features between two consecutive frames uses a weighted sum of the distances of selected features. The histogram distance is a standard normalized χ2 distance, which is widely used in image indexing

algorithms. The updating process of the consistency indicator is controlled in terms of time delay defined by the human expert as the stability indicator. For the Pets test sequence, we consider that the consistency indicator reaches its maximum value after a delay of 1 second of stable associations. The track termination is decided for effective objects after a fixed delay of consecutive zero value of the consistency indicator. The delay has been fixed typically at 10 seconds. For each tracked object a set of information is stored (table 1).

Object information Object number Initial object number for split object Date of creation Track termination date Position(k), Bounding box (k) Histograms(k) Consistency indicator (k) Identity indicator(k) Table 1: object information 3.2.2. Merging procedure The notion of a group of objects is defined temporally in order to track global regions produced after the merging of initially distinct objects. When several objects have merged their regions, the system generates at each frame an “artificial observation” for each merged object in order to maintain their track inside the global region. This localization of individual objects is based on their appearance model composed by the bounding box and the color histogram. The appearance model of each object is updated continuously until the detection of the merging situation. During merging, an object may be hidden entirely by another. In this situation the artificial observation is brought close to the center of gravity of the global group position. In order to control actively the construction of this artificial observation, the matching score of object inside the group is used. The matching score as the correlation is based on the minimal distance of the appearance model computed over the group region. The interest of a color based histogram approach compared to a correlation approach is its robustness to scale variation and object deformation. But its drawback concerns its lower selectivity. The artificial observation is computed by a weighted combination of the best appearance position (weight: matching

score) and the global group position (weight: 1– matching score). This approach of compound observation is close to the PDAF algorithm [14], which combines the influence of multiple candidates in the validation gate. By using the notion of artificial observation, objects contained in the group can be tracked and updated individually (figure 5). During merging-situation the consistency indicator is updated by taking into account only the stability of the artificial object speed. The global procedure during the mergingsituation may be relatively time-consuming. So the decision of creation of a group is robustly validated once it has been predicted by using the proximity of consistent tracked objects. It permits to tolerate efficiently some fugitive false-detections. At the other hand, this last strategy can not initialize a group if one of the merged objects is newly created and has a low consistency indicator. 3.3.2. Splitting procedure When the algorithm detects split regions associated to a known group, a specific procedure focuses its attention toward the identity of objects. A visual comparison between regions before merging and after splitting permits to affect the best identity for each region. A second indicator, which represents the quality of the identity of each object, is introduced. The identity indicator is set to its maximum at each track initialization. The role of this indicator is to complete tracking information by propagating identity ambiguity over the global visual surveillance system. For modeling identity ambiguity, we have used a possibilistic approach. The possibility theory is one of the uncertainty theories as the probability or Dempster Shafer theory. For each proposition two belief measures called possibility and necessity are associated. This theory as the Dempster Shafer theory, permits to manipulate naturally the notion of ignorance by two dual measures. The possibility measure P(A) represents the degree of easiness of the achievement of proposition A. The value 0 means that the proposition is completely impossible. The value 1 indicates that the proposition is completely possible. The necessity N(A) measure explains the degree of certainty of A, it is estimated by the degree of impossibility of the complementary propositions A . The values of the two measures are defined within the interval [0,1].

N ( A) = 1 − P( A) With

(5)

P ( A ) = max

P (B )

B∈ A

The figures 5 and 6 illustrate how the tracking algorithm maintains tracks during merging situation by generating artificial observations.

These two measures are well adapted for task as objet recognition, object re-identification and event recognition [10]. The possibility measure can naturally explain the notion of compatibility between objects or events and the necessity measure evaluates the uniqueness of the association. The evaluation of identity ambiguity is obtained by the notion of necessity of each object. First a possibility measure of compatibility of the object i with other objects j is estimated. This measure uses the object size and histogram distances between objects (d1(i,j), d2(i,j)). P(i,j)= min (1 − d k (i, j ))

(6)

k =1, 2

Then the necessity measure N(i) for each object i is deduced. N(i)= 1 − max ( P (i, j )) (7)

Figure 5: case 1, car and pedestrian merging situation

j≠i

The identity indicator ∆ i k of the object i at frame k is updated using this expression: ∆ i k =min(∆ i k-1, N (i))

(8)

The identity indicator is updated after the groupsplitting situation only. At each variation of the indicator, the algorithm stores also the identity number of other possible objects. Currently the splitting procedure tolerates groups containing a maximum of three objects. When a region of a sole object splits without being known as a group, separated detected regions are associated with new objects, which inherits the history of the initial object (previous tracks position and indicators). The track termination of the initial object is also decided when newly created objects from the splitting procedure persist favorably by means of their consistency indicators. It permits to reduce the influence of some detection errors when a real sole object is split by the motion segmentation during a short delay. For the test sequence, two merging and splitting situations exist: * Case 1: A pedestrian and the white car merge at frame 850 and split at the frame 880 * Case 2: Two pedestrians merge at the frame 1215 and split at the sequence 1251.

Figure 6: case 2, pedestrians merging situation 4. Tracking results for the test sequence In this last section we present tracking results over the whole test sequence (figure7).

Finally we have study the robustness of our algorithm with respected to the temporal and spatial working resolution over the test sequence (figure 8). Three temporal (1/1:every frames, 1/2: half resolution and 1/8 frames) and three spatial (384x288, 192x144 and 86x72) resolutions are tested. The global performance is evaluated by using a specific criteria representing the presence duration in second of well tracked objects identified by an identity indicator higher than 0.5 and consistency indicator higher that 0.9. The setting of the algorithm for each spatial resolution concerns principally the minimal size of the detected object in pixel. We can observe that the performance of the algorithm decreases significantly with low spatial resolution images. The main difficulty appears at the motion detection stage, which can not detect efficiently small regions associated to pedestrians.

Figure 7: global tracking results with associated tracking consistency indicator. The first pedestrian is at the beginning very distant from the camera inducing a small detection region and is also hidden twice by occluding trees along its trajectory. The object associated to this pedestrian is created from the frame 630. Then this pedestrian walks too slowly and is integrated to the background several times. The tracking consistency indicator is decreased to its minimum at each integration. Between frames 990-1120 the third pedestrian exits slowly from the gray parked car and can not be detected efficiently. Its colors are at the beginning too close to the background. For the test sequence, the identity indicator is changed only for the two merged pedestrians after their splitting (frame 1251). The two pedestrians have been tracked correctly but the system has detected a small risk of identity confusion because these objects have relatively similar size and histograms. Their identity indicators are decreased from the value 1 to 0.7. This tracking algorithm has been implemented in an 850 MHz Compatible PC running Windows NT 4.0. Images of the test sequence have been converted to BMP format with a resolution of 384x288 pixels. The minimal measured processing speed in BMP format image sequence is around 4 frames/s with a maximum at 8 frames/s.

Figure 8: performance evaluation with respect to resolution changes 5. Conclusion One important feature of the proposed algorithm concerns the control of motion detection and tracking by using specific belief indicators. At the pixel level, it represents the adaptation of the background in order to tolerate illumination change and background object integration. At the tracking level, the algorithm takes into account several difficulties: false alarms, regions merging, splitting, and target losses. An identity indicator is created in order to qualify efficiently identity confusion during tracking. Tracking indicators constitute complementary information for summarizing tracking uncertainties in visual-surveillance context. We have also used them as a global performance evaluation indicator for studying the tracking algorithm robustness with respect to the working image sequence resolutions. Our future works concern the

improvement and optimization of the color based appearance models, which are useful for re-assigning objects after merging or long term disappearance. References [1] A.Black, R.Curwen, A.Zisserman, A framework for spatio-temporal control in tracking of visual contours, Int. J. Comp. Vision 11 (2) 1997, 127145. [2] A. Bobick. Movement, activity, action : the role of knowledge in perception of knowledge. Proc Workshop on knowledge-based vision in Man and Machine, London, England 1997. [3] F. Bremond, & M. Thonnat. A context representation for surveillance systems. In Proc. of the Workshop on Conceptual Descriptions from Images at The European Conference on Computer Vision (ECCV), Cambridge. 1996. [4] J.W.Davis, A.F.Bobick. The representation and recognition of action using temporal templates. In proc. Computer Vision and Pattern recognition, 1997. [5] D.Dubois et H. Prade, « Possibility theory and data fusion in poorly informed environments ». Control Eng. Practice, vol. 2(5) , 1994,pages 811823. [6]

K.P.Karmann, A.Brandt, Moving object recognition using an adaptive background memory”, in V.Cappellini(ed), “time varying image processing and moving object recognition” Elsevier Publishers B.V., Amsterdam, Netherlands, 1990.

[7] D.Koller, K.Daniilidis, H.Nagel. Model based object tracking in monocular images sequences of road traffic scenes, Int. J. Comp. Vision 10, 1993, 257-281. [8] F.Meyer, P.Bouthemy. Region-based tracking using affine motion models in long image sequence. Computer Vision Graphics Image processing, 60 (2), 1994, 119-149. [9] C.Motamed, P.Vannoorenberghe. Behavioural knowledge for video surveillance. in proc.. of Int.. Workshop Dynamic scene recognition from sensor data‘’ Onera,Toulouse,1997, pp1-10. [10] C.Motamed. Video Indexing Based on object Motion in visual-surveillance context. Conference RIAO 2000 (Content-Based Multimedia Information Access), Paris, 2000, pp 586-593.

[11] D.B.Reid. An algorithm for traking multiple targets. IEEE Transaction on Automatic Control, 24 (6), 1979, 843-854. [12] C.Rider, O.Munkelt, H.Kirchner. Adaptive background and foreground detection using Kalman filtering, In international Conference on recent advances in Mechatronics, 1995, pages 193-199. [13] I.K.Sethi and R.Jain. Finding trajectories of feature points in a monocular image sequence. IEEE Transaction on Pattern Analysis and Machine Intelligence, 9 (1), 1990, 56-73. [14] Y. Bar-Shalom & T.E. Forthmann, Tracking and data association. Academic Press. 1988. [15] C.Staufer and W.Grimson; “Adaptive background mixture models for real time tracking”, in CVPR’99, , Ed Collins,CO, 1999, 246-252. [16] P.Vannoorenberghe, C.Motamed, and J.G.Postaire, Updating a reference image for detecting motion in urban scenes (in french), J. Traitement du Signal, vol 15 n°2, 1998, p139-148. [17] Q.Zheng, R.Chellappa Automatic feature point extraction and tracking in image sequences for arbitrary camera motion. Int. J. Comp. Vision 15, 1995, 31-76.