Consciousness-driven Model for Visual Attention

theories of visual attention separate the two processes both ... 1, the objects or texture, in which ... textures can be separated by the virtual boundaries (the dash.
255KB taille 1 téléchargements 335 vues
Consciousness-driven Model for Visual Attention (Invited Paper) Pierre CAGNAC, No¨el DI NOIA, Chao-Hui HUANG, Daniel RACOCEANU, Laurent CHAUDRON Abstract— In this paper, a consciousness-driven visual attention model is presented. It is base on a hierarchical analysis on the given visual receptive field. The analysis is base on the complexity analysis of the content information on the given visual scene. The same approach has been applied on each level of the hierarchy. In other words, a divide-and-conquer approach is suggested to analysis the content information on the given visual scene. The proposed approach is relative simple and straight-forward method, thus can be easily implemented in a parallel processing architecture.

I. I NTRODUCTION

T

HE study of visual attention has been argued in various aspects. It aims to discuss how does a visual system select variable targets from the given visual receptive fields. Indeed, most of visual systems have limited capacities, they cannot process every signal that falls onto their visual receptive fields. Instead, they rely on the visual attention to bring salient details into focus and filter out background clutter. Obviously, the visual attention is a major field within cognitive psychology with many studies proposed in the last decades. The theory of visual attention is frequently suggested as a combination of recognition and selection. Whereas many theories of visual attention separate the two processes both in time and in representation. Visual attention is also directed to different levels of a visual scene. Subjects on visual field may selectively attend to either the global or local level of compound objects. Navon found that the responses to global targets were faster than responses to local targets. And, global distractors interfered with local target processing but not vice versa [1]. Some researchers suggested that the visual attention is compounded by two steps: in the first step, the saliency. Some stimuli are intrinsically conspicuous or “salient” in a visual scene. These stimuli, as the results, automatically attract attention. Saliency operates very fast and efficiency. The speed of saliency-based from of visual attention is on the order of 25 to 50 ms per item. If a stimulus is sufficiently salient, it will pop out of a visual scene. Hence, Itti et al. suggested that saliency is computed in a pre-attentive manner Pierre CAGNAC and No¨el DI NOIA are with the French Air Force Academy, France (email: {p.cagnac, noel.dinoia}@hotmail.fr). Chao-Hui HUANG is with the Bioinformatics Institute (BII), Agency for Science, Technology, and Research (A-STAR), Singapore (email: [email protected]). Daniel RACOCEANU is with the Image & Pervasive Access Lab (IPAL), Centre National de la Recherche Scientifique (CNRS), France (email: [email protected]). Laurent CHAUDRON is with the French Aerospace Lab (ONERA), France (email: [email protected]).

across the entire visual field, most probably in terms of hierarchical centre-surround mechanisms [2]. The second step involved a more complicate mechanism that contains selection capability. The expression of this visual attention is most probably managed by higher areas in brain, such as frontal lobe. Since a more complicate mechanism is included, it takes a longer time, longer than 200 ms, for the processing [2]. As the results, in the research of visual attention, it is necessary to discuss the role and position of visual consciousness. The discussion can be base on an famous hypothesis among neuroscience researchers is that consciousness is generated by the interoperation of various parts of the brain. It is called the neural correlates of consciousness (NCC). The idea of NCC implied that it is possible to construct machines (e.g., computer systems) that can emulate the interoperation of artificial consciousness (AC). Although the AC, is still beyond the existing technologies, there are some researches which are starting to reveal the secrets of the NCC. For example, Tononi et al. suggested the consciousness model as the information integration of complexity [3], [4], [5]. Itti et al. argued a computational modelling of visual attention which is base on the understanding of neural mechanisms for the control attention. In this paper, we present a consciousness-driven model for visual attention. It is a sort of divide-and-conquer approach. It first compute the whole given visual scene. Thus, the optimal location of an equilibratory point is obtained. This point is considered as one of the candidates of locations of visual attention. Next, the given scene is divided into four partitions. The algorithm and then be applied on each partition respectively. Eventually, An example is shown in Fig. 1, the objects or texture, in which the information is contained, are distributed on a 2-d Euclidean space. Fig. 1(a) and Fig. 1(b) represent two typical kinds of distribution, in which the circles and squares represent two different kinds of objects or textures. These textures can be separated by the virtual boundaries (the dash lines). In Fig. 1(c) and Fig. 1(d), the objects or textures are separated by virtual boundaries. Base the proposed algorithm, the optimal locations of equilibratory points are obtained. Fig. 1(e) and Fig. 1(f) present the equilibratory points can produce four regions on the 2-d Euclidean space. In the following sections, the further details is presented. Some results and the conclusions and then follow.

and implemented as the Gaussian color model [6], [7], [8], [9]. According to Geusebroek et al., the opponent colours can be considered as the color space in Gaussian color model. The opponent colours can be obtained by: ured = log(r), ublue = log(b), (a)

(b)

ugreen = log(g), and

uyellow = log( 12 (r + g)).

(1)

Here (ured , ugreen ) and (ublue , uyellow ) are used to describe the opponent colors: Red-Green (RG), Blue-Yellow (BY), and Luminance (L):   vRG = ured − ugreen vBY = ublue − uyellow (2)  vL = 32 (ured + ugreen + ublue ) − 1 (c)

(d)

Those color pairs and the luminance information can be used to describe the visual signal of the human visual system [10], [6], [11]. Based on these color information, the human visual system is able to extract such features from images. In our method, these feature extraction algorithms include intensity, color and texture perception. B. Information Extraction

(e)

(f)

Fig. 1. The representations of the information of the relevant objects distributed on a 2-d Euclidean space (for example, an image). (a) and (b) represent two typical kinds of distribution, in which the circles and squares represent two different kinds of objects or textures. These textures can be separated by the virtual boundaries (the dash lines). In (c) and (d), the objects or textures are separated by virtual boundaries. Base the proposed algorithm, the optimal locations of equilibratory points are obtained. (e) and (f) present the equilibratory points can produce four regions on the 2-d Euclidean space.

II. M ETHOD In this paper, we suggest a consciousness-driven model for visual attention. It is base on a hierarchical architecture. On each given input image, it first is decomposed into a bioinspired model of color representation. The visual stimuli is converted from Red, Green, and Blue to Red-Green, BlueYellow, and Luminance. Next, For each given visual scene, we compute the the optimal location of an equilibratory point. By using this equilibratory point, the given visual scene thus can be divided into four partitions. The optimization is obtained base on the balance of the complexities of content information on these four partitions. Third, the same algorithm is applied on these four partitions respectively. In each time, the content information on the location of the equilibratory point is evaluated. Eventually, the lkocations of these equilibratory points have been considered as important targets, thus need to pay attention on them. A. Color Representation There are many arguments compared the pros and cons among various color spaces. According to Geusebroek et al., the opponent color theory can be applied to computer vision

In order to compute the complexities of the information on the partitions, the X in (8) is necessary. There are various ways to obtain the X. A typical method is the Difference of Gaussians (DoG). 1) Difference of Gaussians: In computer vision, DoG is a grey-scale image enhancement algorithm that involves the subtraction of one blurred version of an original grey-scale image from another, less blurred version of the original. The blurred images are obtained by convolving the original greyscale image with Gaussian kernels having differing standard deviations. Blurring an image using a Gaussian kernel suppresses only high-frequency spatial information. Subtracting one image from the other preserves spatial information that lies between the range of frequencies that are preserved in the two blurred images. Thus, the DoG is similar to a band-pass filter that discards all but a handful of spatial frequencies that are present in the original grey-scale image. 2) Computer Fovea Model: A generalized model for the feature extraction is proposed in [11]. Huang et al. suggested a bio-inspired computer fovea model in order to model the biological phenomena in the retina. The computer fovea model aims to model a simplified version of the full retina system. First, a general assumption of center/surround receptive field (RF) of ganglion can be considered as a reference. Some physiological experiments indicated that the RF of the ganglion exhibits a center/surround characteristic. Furthermore, various publications stated that the RF of ganglions can be modeled as follows [12], [10], [13], [14], [15], [16]: hG (δ(α)) ,

2 h (GσG (α)),

(3)

` where 2 (·) denotes a Laplace filter and GσG (·) is a Gaussian filter with standard deviation σG .

Photoreceptor

z(α)

hR

Bipolar

zR (α)

y

Ganglion

y(α)

Σ

A

g

B

−b

(ˆ x, yˆ) C

D

Horizontal

z ′ (α)

h R′

x ′ zH (α)

hH

Fig. 3. An example of image segmentation by using a given location of point, the given image has been divided into four partitions: A, B, C, and (a) The computer fovea model, proposed by Huang et al. [11], including theD. photoreceptor cells, the horizontal cells, the bipolar cells and the ganglion cells.

Photoreceptor

z(α)

hR

Bipolar

zR (α)

in response.

Ganglion

g



−b

Horizontal

hH



zH (α)

The computer fovea model proposed in [11].

Photoreceptor: Photoreceptor includes various cone and rod cells. There cells react to different wavelength of visible light. Cone cells can be roughly classified into three types: short wavelength (S/blue cell), middle wavelength (M/green cell), and long wavelength (L/red cell). Generally speaking, a bipolar cell collects the spiking from a set of cone cells and form a diffuse pathway. The use of a Gaussian function to model the diffusion is suggested in various publications [14], [16]. Thus, hR (δ(α)) , GσR (α),





1 1 (δ(α) − hG (α) ⊗ h−1 R (α)). b g

(5)

Ganglion: In most cases, a ganglion cell collects signal from only one bipolar cell. Thus, in this model, only a bias g is used to represent the function of a ganglion cell.

C. Optimal Equilibratory Point Given a set of objects distributed on a 2-d Euclidean space (ex, an image), a predefined point (x, y) can segment the space into four partitions. Fig. 3 shows an example, in which the space has been divided into four partitions: A, B, C, and D. The optimal location of the equilibratory point is obtained by using the following equation:

(b) A simplified version of computer fovea model Fig. 2.

hH (α) ,

y(α)

Σ

(4)

where GσR (·) represents a Gaussian filter. As mentioned in [14], σR represents the standard deviation with a range from 1.5 to 12 (cell space). Bipolar Cell: Bipolar cells collect the signal from a number of cone cells and transmit the spiking to ganglion cells. Although there are various types of bipolar cells, in this model the simplest one is chosen. That is, the bipolar cell maps to one cone cell and its opponent channel of cone cell with a bias −b. Horizontal Cell: Horizontal cells have been considered as a set of cells to contribute to the surround response of bipolar cells. The horizontal cells have been shown to be color opponent

(ˆ x, yˆ) = argmin (x,y)

X

P,Q∈{A,B,C,D}|(x,y)

(H(P ) − H(Q))2 (x,y) ,

(6) where the A, B, C, and D are the images that was produced by the segmentation base on a predefined point (x, y) on a given input image (see Fig. 3) and H(·) represents a measurement function of the complexity on the input image A, B, C, or D. Thus (6) provides a optimal location of the point for the segmentation. In order to obtain the optimal location of the equilibratory point on the space, we compute the entropies of the information on the partitions A, B, C, and D. The entropy H of a given variable X with possible values {x1 , . . . , xn } is H(X) = E(I(X)).

(7)

Here E is expected value, and I is the information content of X. If p denotes the probability mass function of X then the entropy can explicitly be written as n n X X p(xi ) logb p(xi ), (8) p(xi )I(xi ) = − H(X) = i=1

i=1

An example of image segmentation is shown in Fig. 3. By using a given location of point, the given image has been divided into four partitions: A, B, C, and D. Base on (8), base on the obtained optimal location of the equilibratory point, the level of complexity on each partition is similar to any other.

III. R ESULTS

AND

D ISCUSSIONS

A result of the proposing approach, where Fig. 5(a) is the given input, Fig. 5(b) represents the measurement of the amount of the content information, and Fig. 5(c) is the final result, where the grey-crosses are the candidates of the locations for visual attention, and the red-circles are the selected locations for the visual attention. IV. C ONCLUSIONS (a)

In this paper, a consciousness-driven visual attention model is presented. It is base on a hierarchical analysis on the given visual receptive field. The analysis is base on the complexity analysis of the content information on the given visual scene. The same approach has been applied on each level of the hierarchy. In other words, a divide-andconquer approach is suggested to analysis the content information on the given visual scene. The proposed approach is relative simple and straight-forward method, thus can be easily implemented in a parallel processing architecture, such as a graphic processing unit (GPU) in a modern personal computer, or a field programmable gate array (FPGA) for an embedded system.

(b)

R EFERENCES

(c) Fig. 4.

The example of hierarchical architecture.

D. Hierarchical Architecture The hierarchical architecture is shown in Fig. 4, where Fig. 4(a) is the first level. The black dot represents the optimal equilibratory point of the first level. Fig. 4(b) is the second level. Base on the the optimal equilibratory point of the first level, the given visual scene is divided into four partitions. These four partitions are processed by using the same procedure. As the results, four optimal equilibratory points are obtained. Fig. 4(c) represents the third levels. subsequently, the the optimal equilibratory points of this level are obtained by using the same procedure. Eventually, the locations of these optimal equilibratory points are validated base on predefined criteria. For example, the threshold on the amount of context information as the following:

Ct (X) =



1, H(X) > t, 0, otherwise.

(9)

[1] D. Navon, “Forest before trees: The precedence of global features in visual perception,” Cognitive Psychology, vol. 9, pp. 353–383, 1977. [2] L. Itti and C. Koch, “Computational modelling of visual attention,” Nature Reviews Neuroscience, vol. 2, pp. 194–203, 2001. [3] G. Tononi, G. M. Edelman, and O. Sporns, “Complexity and coherency : integrating information in the brain,” Trends in Cognitive Sciences, vol. 6613, pp. 474–484, 1998. [4] G. Tononi and O. Sporns, “Measuring information integration,” BMC Neuroscience, vol. 20, pp. 1–20, 2003. [5] G. Tononi, “An information integration theory of consciousness,” BMC Neurosci, vol. 5, no. 42, 2004. [6] J.-M. Geusebroek, R. van den Boomgaard, a. Smeulders, and H. Geerts, “Color invariance,” IEEE Trans. PAMI, vol. 23, pp. 1338– 1350, 2001. [7] J.-M. Geusebroek, R. v. d. Boomgaard, A. W. M. Smeulders, and T. Gevers, “Color constancy from physical principles,” Pattern Recognition Letters: special issue on colour image processing and analysis, vol. 24, no. 11, pp. 1653–1662, 2003. [8] M. A. Hoang, J.-M. Geusebroek, and A. W. M. Smeulders, “Color texture measurement and segmentation,” Signal Processing: special issue on content based Image and video retrieval, vol. 85, no. 2, pp. 265–275, 2005. [9] H. M¨uller, N. Michoux, D. Bandon, and A. Geissbuhler, “A review of content-based image retrieval systems in medical applicationsclinical benefits and future directions,” International Journal of Medical Informatics, vol. 73, no. 1, pp. 1–23, 2004. [10] D. Hubel, “Eye, brain, and vision,” Scienceific American, 1988. [11] C. H. Huang and C. T. Lin, “Bio-inspired computer fovea model based on hexagonal-type cellular neural network,” IEEE Trans. Circuits Syst. - I, vol. 54, no. 1, pp. 35–47, 2007. [12] D. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction,and functional architecture in the cat’s visual cortex,” J. Physiol. (Lond.), vol. 160, pp. 106–154, 1962. [13] S. Shah and M. D. Levine, “The primate retina: a biological summary,” Ph.D. dissertation, 1992. [14] ——, “Visual information processing in primate cone pathways – part i: a model,” IEEE Transactions on Systems, Man, and Cybernetics Part B: Cybernetics, vol. 26, no. 2, pp. 259–274, 1996. [15] ——, “Visual information processing in primate cone pathways – part ii: experiments,” IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, vol. 26, no. 2, pp. 275–289, 1996.

(a)

(b)

(c) Fig. 5.

The result of the proposing approach.

[16] J. Thiem and G. Hartmann, “Biology-inspired design of digital gabor filters upon a hexagonal sampling scheme,” in 15th International Conference on Pattern Recognition (ICPR’00), vol. 3, 2000, pp. 445– 448.