Detecting-and-Tracking-Moving-Objects-for-Video-Surveillance.pdf

资源描述

《Detecting-and-Tracking-Moving-Objects-for-Video-Surveillance.pdf》由会员分享，可在线阅读，更多相关《Detecting-and-Tracking-Moving-Objects-for-Video-Surveillance.pdf（7页珍藏版）》请在三一文库上搜索。

1、IEEE Proc. Computer Vision and Pattern Recognition Jun. 23-25, 1999. Fort Collins CO Detecting and Tracking Moving Objects for Video Surveillance Isaac CohenG erard Medioni University of Southern California Institute for Robotics and Intelligent Systems Los Angeles CA 90089-0273 ?icohen?medioni?iris

2、.usc.edu Abstract We address the problem of detection and tracking of mov- ing objects in a video stream obtained from a moving air- borne platform. The proposed method relies on a graph rep- resentation of moving objects which allows to derive and maintain a dynamic template of each moving object b

3、y en- forcing their temporal coherence.This inferred template along with the graph representation used in our approach allows us to characterize objects trajectories as an optimal path in a graph. The proposed tracker allows to deal with partial occlusions, stop and go motion in very challenging sit

4、uations. We demonstrate results on a number of different real sequences. We then defi ne an evaluation methodology to quantify our results and show how tracking overcome de- tection errors. 1Introduction The increasing use of video sensors, with Pan-Tilt and Zoom capabilities or mounted on moving pl

5、atforms in surveillance applications, have increased researchers atten- tion on processing arbitrary video streams. The processing of a video stream for characterizing events of interest relies on the detection, in each frame, of the objects involved, and the temporal integration of this frame based

6、 information to model simple and complex behaviors. This high level de- scription of a video stream relies on accurate detection and tracking of the moving objects, and on the relationship of their trajectories to the scene. In this paper, we address the problem of detecting and tracking moving obje

7、cts in the context of video surveillance. Most of the techniques used for this problem deal with a sta- tionary camera 4, 3 or closed world representations 8, 6 which rely on a fi xed background or a specifi c knowledge on the type of actions taking place. We deal with a more challenging type of vid

8、eo streams: the one obtained from a moving airborne platform. This more general case allows us to evaluate the proposed approach for processing video streams acquired in real world video surveillance situations. We propose an approach which relies on a graph repre- sentation of detected moving regio

9、ns for deriving a robust tracker. The detection phase performed after the compensa- tion of the image fl ow induced by the motion of observation platform produces a large number of regions. Indeed, the use residual fl ow fi eld and its normal component, i.e. nor- mal fl ow, to locate moving regions

10、also detects the registra- tion errors due to local changes not correctly handled by the stabilization as well as 3D structures i.e. parallax. Defi ning an attributed graph where each node is a detected region and each edge is a possible match between two regions detected at two different frames, pr

11、ovides an exhaustive representa- tion of all detected moving objects. This graph representa- tion allows us to maintain a dynamic template of all moving objects which is used for their tracking. Moreover,the graph isusedtocharacterizeobjectstrajectoriesthroughanoptimal search path along each graphs

12、connected component. Thepaperis organizedasfollows; wefi rst describeinsec- tion2thedetectiontechniqueused. Thegraphrepresentation and the dynamic template inference are described respec- tivelyinsections 3and4. Section5presentsthemethodused for deriving objects trajectories from the associated grap

13、h. Finally, in section 6 we describe the evaluation technique used for quantifying the results obtained on the set of pro- cessed videos. 2Detection of Moving Objects Most available techniques for detecting moving objects have been designed for scenes acquired by a stationary cam- era. These methods

14、 allow to segment each image into a set 1 of regions representing the moving objects by using a back- grounddifferencingalgorithm6, 4. Morerecently,3 have proposedalocalmodelingofthebackgroundusingamixture of K-Gaussian allowing to process video streams with time varyingbackground. These methods giv

15、e satisfactory results and can be implemented for real time processing without dedicated hardware. The availability of video sensors, at low cost, with Pan- Tilt and Zoom capabilities or video streams acquired by moving platforms, have focused the attention of researchers on the detection of moving

16、objects in a video streams ac- quired by a moving platform. In this case, the background differencing techniques cannot be used. They have to rely on a stabilization algorithm in order to cancel the camera motion. Such a two-step technique, i.e. stabilization and de- tection, does not perform perfec

17、tly since the detection tech- niques based on background differencing assume a perfect stabilization. Indeed, stabilization algorithms use an affi ne or perspective model for motion compensation and the qual- ity of the compensation depends on the observed scene and on the type of acquisition (i.e.

18、Pan-Tilt-Zoom, arbitrary mo- tion.). Therefore, the motion compensation is not error free and induces false detection. However, one can use the tem- poral coherence of the detected regions in order to increase the accuracy of the moving object detection 10. Insteadof using this two-step approach,we

19、proposeto in- tegratethedetectionintothestabilizationalgorithmbylocat- ing regions of image where a residual motion occurs. These regions are detected using the normal component of the op- tical fl ow fi eld. Normal fl ow is derived from image spatio-temporal gra- dients of the stabilized image sequ

20、ence. Each frame of this image sequence is obtained by mapping the original frame to the selected reference frame. Indeed, let? ?denote the warping of the image?to the reference frame?. The map- ping function is defi ned by the following equation: ? ? ? ? ? ? ? (1) andthestabilizedimagesequenceis de

21、fi nedby? ? ? ? ? ? ?. The estimation of the mapping function amounts to estimate the egomotion, based on the camera model which relates 3D points to their projection in the image plane. The approach we use, models the image induced fl ow instead of the 3D parameters of the general perspective trans

22、form 7. The pa- rameters of the model are estimated by tracking a small set of feature points? ? ? ? ?in the sequence. Given a reference image? ?and a target image ? ?, image stabilization consists of registering the two images and computing the geometric transformation?that warps the image? ?such t

23、hat it aligns with the reference image? ?. The parameter estimation of the geometric transform?is done by minimizing the least square criterion: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (2) where outliers are detected and removed through an itera- tive process. We choose an affi ne model, which approxi- ma

24、tes well the general perspective projection, while having a low numerical complexity. Furthermore, a spatial hierar- chy, in the form of a pyramid, is used to track selected fea- ture points. The pyramid consists of at least three levels and aniterativeaffi neparameterestimationproducesaccuratere- s

25、ults. The reference frame and the warped one do not, in gen- eral, have the same metric since, in most cases, the mapping function? ? is not a translation but a true affi ne transform, and therefore infl uences the computation of image gradients for moving object detection. This change in metric can

26、 be incorporated into the optical fl ow equation associated to the image sequence? ?in order to detect more accurately the moving objects. Indeed, the optical fl ow associated to the image sequence?is: ? ? ? ? ? ? ? ? ? ? ? (3) where? ? is the optical fl ow. Expanding the previ- ous equation we obta

27、in: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(4) and therefore, the normal fl ow? ?is characterized by: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (5) Although? ?does not always characterize image motion, due to the aperture problem, it allows to accura

28、tely de- tect moving points.The amplitude of? ?is large near moving regions, and becomes null near stationary regions. Figure 1 illustrates the detection of moving vehicles in a video stream taken from an airborne platform.We encourage the reader to view the movie fi les available athttp:/iris.usc.e

29、du/home/iris/icohen/ pub- lic html/tracking.htmwhich illustrate the detection on the raw video sequence and on the projected mosaic. 3Graph Representation of Moving Objects The detection of moving objects in the image sequence gives us a set of regions which represent the locations where Figure 1: D

30、etection of several vehicles in a video stream acquired by an airborne platform. a motion was detected. The normal component given by equation (5) allows, given a pair of frames, to detect points of the image where a motion occur. These points are then aggregated into regions by considering a thresh

31、olded value of the normal component of the optical fl ow, and then la- beled using a 4-connectivity scheme. Each of these con- nected components represents a region of the image where a motion was detected. The purpose of detecting moving objects in video stream is to be able to track these objects

32、over time and derive a set of properties from their trajectory such as their behaviors. Commonly used approaches for tracking are token-based, when a geometric description of the object is available 2, or intensity-based (optical fl ow, correlation.). These tech- niques are not appropriate for blob

33、tracking since a reliable geometricdescriptionofthe blobscannotbe inferred. Onthe other hand, intensity-based techniques ignore the geometric description of the blob. Our approach combines both tech- niques by incorporating in the representation of the moving objects both spatial and temporal inform

34、ation. Such a repre- sentationis providedbyagraphstructurewherenodesrepre- sent the detected moving regions and edges represent the re- lationship between two moving regions detected in two sep- arate frames. Each newly processed frame generates a set of regions corresponding to the detected moving

35、objects. We search for possible similarities between the newly detected objects and the previously ones. Establishing such connec- tions can be done through different approaches such as tem- plate matching 5 or correlation 11. However, in video surveillance, little information about the moving objec

36、t is available, since the observed objects are of various types. Also, objects of small size (humans in airborne imagery) or large changes of objects size are frequent and therefore un- suitable for template matching approaches. Frame 124Frame 125 Frame 126 Frame 123, 4, L=40 Frame 124, 9, L=39 Fram

37、e 125, 11, L=38 Frame 125, 12, L=38 Frame 125, 15, L=38 Frame 126, 8, L=37 0.9965 0.9972 0.9999 0.9997 0.9964 0.9999 0.9997 Figure 2: Detected regions and associated graph. Each pair of frames gives us a set of regions where resid- ualmotionwas detected(see Figure2). Theseregionscan be related to th

38、e previously detected one by measuring the gray levelsimilaritybetweena regionat time?andaset ofregions at time?located in its neighborhood. A region may have multiple matches, and the size of this neighborhood is ob- tained from the objects motion amplitude. In Figure 2 we show the graph representa

39、tion associated to the detected red blob. Each node is a region represented by an ellipsoid de- rived from the principal directions of the blob and the asso- ciated eigenvalues. Also, a set of attributes is associated to each node as illustrated in Figure 3. We assign to each edge a cost which is th

40、e likelihood that the regions correspond to the same object. In our case, the likelihood function is the image gray level correlation between a pair of regions. 4Dynamic Template Inference The graph representation gives an exhaustive description of the regions where a motion was detected, and the wa

41、y these regions relate one to another. This description is ap- propriate for handling situations where a single moving ob- ject is detected as a set of small regions. Such a situation Mean Variance Velocity Frame number Centroid Principal directions Parents id and similarity Childs id and similarity

42、 Length Figure 3: Description of the attributes associated to each node of the graph. Each color represents a moving region. happens when, locally, the normal component of the optical fl ow is null (aperture problem) and consequently, instead of detecting one region, we have a set of small regions.

43、Usu- ally, clustering techniques are applied for merging the de- tected blobs in order to recover the region corresponding to the movingobject. These image-based techniques 6, 9 rely on the proximity of the blobs in the image and frequently merge regions that belong to separate objects. Among the de

44、tected regions some small regions should be merged into a larger region, or have a trajectory of their own. In both cases, based on the graph representation, these regions belong to a connected component of the graph. In our approach, we cluster the detected regions in the graph ratherthanin a singl

45、e imageas used in previousworks6, 9. Indeed, clustering through the graph prevents us from merg- ing regions belonging to objects having a distinct trajectory, sincethe clusteringbasedonimageproximity,is donewithin a connected component of the graph. The robustness of the clustering technique is als

46、o im- proved by maintaining a dynamic template of the moving objects foreach connectedcomponentand thereforeforeach moving object in the scene. Several techniques were pro- posed for automatically updating a template description of the moving objects; weighted shape description 9 or cu- mulative mot

47、ion images 1 were proposed. The main draw- back of these approaches is that error in shape description (i.e. boundaries) are propagated and therefore these tech- niques are not suitable for moving camera. We propose an approach based on median shape template which is more stable and produces a robus

48、t description of templates. The templates are computed by applying a median fi lter (after Frame 101, 2, L=61 Frame 102, 7, L=60Frame 102, 8, L=60 Frame 103, 8, L=59 *Frame 104, 7 Frame 105, 2, L=57 0.9997 0.9941 0.9997 0.9986 0.9939 0.9708 Figure 4: Propagation of the nodes in order to recover the

49、description of undetected objects. On the left we show the detected region at each frame and, on the right, the asso- ciated graph where the red node represents a node inferred from the median shape of the template. aligning the centroid and the orientation of each blob) over the last fi ve detected frames of the region. The dynamic template allows completing the graph de- scription. In video surveillance

展开阅读全文