IMPLEMENTATION OF RANKING FILTERS ON ... - Xun ZHANG

RECONFIGURABLE ARCHITECTURE BASED ON HIGH DENSITY FPGA DEVICES. Dragomir ... plication of the ranking filter consist in computing ordered neighborhood for ... posed based on universal parallel and dedicated architec- tures.
107KB taille 2 téléchargements 391 vues
IMPLEMENTATION OF RANKING FILTERS ON GENERAL PURPOSE AND RECONFIGURABLE ARCHITECTURE BASED ON HIGH DENSITY FPGA DEVICES Dragomir Milojevic Service des Syst`emes Logiques et Num´eriques Universit´e Libre de Bruxelles CP165/57, 50, Av. F. Roosevelt, B-1050 Bruxelles, Belgium email: [email protected] ABSTRACT In this paper we present the implementation of ranking filters on two different computer architectures. First we consider a general purpose computer based on Intel Pentium 4 microprocessor and we show that when SSE2 extension is used the throughput vary from 20 to 200 MB/sec, depending on the neighborhood size and rank value. Secondly, we consider a reconfigurable architecture using hundreds of processing elements running bit-serial algorithms and implemented in a high density FPGA device. The throughput of such system vary from 900 to 2600MB/sec which is 13 to 40 times faster than the considered general purpose architecture.

tures. We can mention coarse grain approach based on array architectures [2], [3], [4] and sorting networks [5], [6], [7] or more fine grained solutions such as stack filters [8] or bitserial implementations [9], [10], [11]. However proposed solutions are often limited to one particular type of the filter (mostly median), one neighborhood size (3 × 3 pixels) and exhibit very limited degree of parallelism. In this paper we first present the implementation of ranking filters on general purpose architecture, a personal computer based on Intel Pentium 4 microprocessor, but using SIMD extensions of the standard architecture (Section 2). Then we introduce a bit-serial implementation of these filters on a high density FPGA circuit (Section 3) and we compare the obtained performance with the performance of the general purpose architecture (Section 4).

1. INTRODUCTION Ranking filters are frequently used in image processing for impulse noise removal or morphological transforms. Application of the ranking filter consist in computing ordered neighborhood for each pixel of the input image. The pixel of the resulting image is obtained by taking the value indicated by the rank r from the ordered neighborhood. These filters, often applied in the beginning of the image processing chain, are known to be time consuming because the neighborhood of each pixel has to be completely or partially sorted. Neighborhood size varies for different applications from 8 surrounding pixels (3 × 3) to 24 and more (5 × 5, 7 × 7). Today, image processing is most of the time used for real-time applications, such as video, where one can expect data rates of about 30MB/sec. However other applications, such as quality control for example, can easily reach rates of up to 250MB/sec when high speed acquisition cameras [1] are used. For such applications classic sorting algorithms implemented on general purpose computers will not meet the above specifications. In order to accelerate this particular image processing task, different dedicated computing platforms has been proposed based on universal parallel and dedicated architec-

0-7803-9362-7/05/$20.00 ©2005 IEEE

2. GENERAL PURPOSE ARCHITECTURE We consider a typical example of the general purpose architecture (GPA), a personal computer based on Intel Pentium 4 microprocessor with MMX, SSE and SSE2 SIMD extensions. Note that similar extensions are also available on other commercial processors: Altivec on IBM, Max2 on HP, MDMX on MIPS, to name a few. With Pentium 4 SSE2 extension, two independent floating point units can be used for dedicated vector operations on eight 128 bit registers (xmm0-xmm7). Such architecture has an obvious advantage for image processing, because the pixel data is most of the time coded on one byte and instructions can be applied on windows of pixels rather than on few pixels only. Image to be processed is sliced in windows of W x × Wy pixels. For each pixel we consider the neighborhood N of NN = Nx × Ny pixels. The neighborhood is most of the time square: 3×3, 5×5 or 7×7 pixels. The windows present a horizontal and vertical overlap, of N x − 1 and Ny − 1 pixels, so that all pixels within the image can be processed (border effect). From now on we will consider that the height of

602

xmm0 Ny

xmm0>>1 0 xmm0>>2 0 0

{

Wx

Wx-2 output pixels

a.

N[0] N[1] N[2] N[3] N[4] N[5] N[6] N[7] N[8]

Table 1. Performance of the GPA: Intel IPL libraries, proposed implementation and the speedup. Rank

Med.

Min/Max

b. Median

Fig. 1. Local Min/Max computations and sorting network for 3 × 3 median filter SIMD implementation.

N

D1 [MPx/sec]

D2 [MPx/sec]

D2 D1

3×3

24

200

8

5×5

18

143

8

7×7

11

100

9

3×3

6

125

21

5×5

4

42

10

7×7

3

22

7

3. RECONFIGURABLE ARCHITECTURE the window is the same as the height of the neighborhood.

Since the computations are done in the bit-serial mode, we consider b bit planes (b is the number of bits used to code the pixel colour) constructed from all the bits of the same weight of the window (each bit plane has W x × Wy bits). Each bit plane is processed one after another, starting from the bit plane corresponding to the most significant bit (MSB) of pixels. First we will introduce the generalized bit-serial algorithm for any rank and for any size of the neighborhood first introduced in [13]. Then we will briefly describe another bit-serial algorithm that can be used for efficient implementation of the filter when the value of r is exactly 0 or N N (respectively minimum and maximum, if we consider that neighborhood is completely sorted in ascending order). For every bit plane, starting from the MSB plane, we are counting the number of ’1’ (n 1 ) and the number of ’0’ (n 0 ) in the neighborhood. Based on the result of the previous bit plane we will compute:

For Minimum or Maximum filter we use a very simple algorithm. First we find the local minima (or maxima) of all window columns and store this intermediate result in one temporary xmm register. The contents of this register is copied Ny − 1 times, then these Ny − 1 copies are shifted to the left (the first copy is shifted once, the second copy twice and so on). Computation of the local minima (maxima) per column of these N y copies will produce filtered window. Note that only 16 − (N x − 1) pixels will be computed per loaded window, because of the border effect. Fig. 1a. illustrates necessary shifts of the temporary xmm0 register that hold values of local minima (maxima) and how they are used to produce the final result for 3 × 3 neighborhood size. For an arbitrary size of the rank, already proposed sorting networks [5], [6], [7] can be used for mapping to SSE2. Fig. 1b. indicates a sorting network used for 3 × 3 median filter, the arrow indicates the maximum value of two vector elements comparison. For a small neighborhood size (3 × 3 pixels) the number of xmm registers is sufficient, so there are no extra memory accesses to the main memory. With bigger neighborhood sizes this is not true any more, so frequent memory accesses are inevitable in order to store intermediate results. This will obviously contribute in performance decrease.

n = n + n1

(1a)

n = n − n0

(1b)

Note that initially n = 0 and that we consider the number of ’1’, so for the MSB plane we have: n = n 1 . The value of n is then compared with the rank r and the result of this operation is used to decide either the output bit for this bit plane will be set to ’0’ or ’1’. If n ≥ r, the output bit will be set to ’1’ and every pixel in the neighborhood that have ’0’ for this weight will be discarded from the following search. When the next bit plane will be processed, the value of n is computed using Eq. (1a). If n < r the output is set to ’0’ and all the pixels that have ’1’ for this weight will be excluded from the computation of the next bit plane. For the next bit plane, the value of n will be computed using Eq. (1b). The exclusion of the pixels in the neighborhood is done by setting to ’0’ the corresponding bits of a mask. Note that all of the bits in the mask have been initially set to ’1’. The complete neighborhood is then masked (logical AND

The above algorithms were implemented in assembly language and optimized using Intel VTune Performance Analyzer V5.0 tool. Different memory access and instruction ordering schemes were investigated to avoid cache misses, pipe line stalls and/or operand collisions [12]. The performance of proposed algorithms is compared with IPL (Intel Image Processing Library) V2.1 in Table 1, for two extreme values of rank (Minimum/Maximum and median) and for different neighborhood sizes. Column marked D 1 indicates the throughput of IPL routines and the column marked D 2 indicates the throughput of our implementation, both being measured on 1.5GHz Pentium 4. Finally, the third column indicates the speedup.

603

Wx

SM

Table 2. Implementation results of 32 computational modules in Xilinx XC2V8000 FPGA.

DataIn

Rank

LB

Min/Max Ny

Median N1

N2

NWx-(Nx-1)

N3

MPE

PE1

PE2

PE3

PEWx-(Nx-1)

DataOut

Fig. 2. Reconfigurable bit-serial architecture: one 3 × 3 computational module. between mask and the bit plane) and used as input for the processing element. When the rank is exactly 0 or N N , another much more simpler algorithm can be used instead of the one described above. If all the bits in the neighborhood have the same value, the resulting bit will have that value and all the pixels within the neighborhood will remain active for the computation of the next bit plane. If the bits in the bit plane are different, the pixels that have ’0’ for this weight will be excluded (the exclusion is once again performed by masking the corresponding bits at the input) because: 1 ∗ 2b−1 >

b−2 

1 ∗ 2i

(2)

i=0

The result for this bit plane will be set to ’1’. The minimum of the neighborhood can be found in a similar fashion. The only difference is that the input bit plane is inverted this time, since for two words A and B coded with b bits we have: b−1  i=0

2i − A >

b−1 

2i − B



A≤B

Wx

NP E

Q

FClk [MHz]

3×3

20

576

21318 (45 %)

115

5×5

16

384

27935 (59 %)

115

7×7

14

256

32292 (69 %)

92

3×3

20

576

27209 (58 %)

83

5×5

16

384

36013 (77 %)

58

7×7

14

256

37638 (80 %)

45

in parallel (first bit plane that has to be processed will be the MSB plane). This memory is composed of W x × Ny shift registers. Pixel at the input of the system are first stored into a buffer that holds one line of the window (marked LB on Fig.2). When the line buffer is full, the content of the line buffer is copied to the source memory. A line of Wx − (Nx − 1) identical processing elements (MPE) working in a bit-serial mode has the complete bit plane as input. After the processing of each bit plane, the results are written into the destination memory (DM). This memory will also format data for the output, from the bitserial mode to the conventional word mode, more suitable for transfer. This memory is composed of W x − (Nx − 1) shift registers with parallel (word) output. Input of the source memory, processing of the different bit planes and writing of the results are executed in an optimized pipe line in order to maximize the throughput. Every pixel that enters the source memory will result one processed pixel at the output (once pipe line is full). Because of the boundary problem only W x − (Nx − 1) pixels from the output data are valid and should be stored as a final result for this window. Note that two local memories store only window data. Image that has to be processed and processed image are stored in two independent memory banks. The described module can be easily mapped onto the existing hardware platform for reconfigurable computing such as AlphaData ADP-WRC II board [14], [15] equipped with one Xilinx XC2V8000 high density FPGA device, two independent 256 bits wide memory banks of up to 1GB capacity each. Note that such FPGA can implement at maximum 32 computational modules since the data bus width is 256 bits wide. The complete computational module (processing units, source and destination memory) was described using VHDL and 32 modules were implemented in Xilinx XC2V8000 high density FPGA device. We consider Minimum, Maximum and Median filters for different neighborhood sizes. Table 2 indicates: the horizontal size of the processing window (Wx ), total number of processing elements (N P E ), the number of occupied slices Q for the given FPGA device (the percentage of all available resources is given between brack-

DM

∀b ∈ N + ,

N

(3)

i=0

and the opposite is true. We define one computational module as a system composed of one source and one destination memory (further referred as local memory) and the associated matrix of reconfigurable processing elements (Fig.2). The source memory (marked SM on Fig.2) holds the complete window that has to be processed and formats this data in such manner that processing elements can access all the bits from a plane

604

mapped to a grid of hundreds of identical processing units that can be used for computation of any ranking filter with typical neighborhood sizes. The throughput of such system vary from 900 to 2600MB/sec and can easily meet the requirements of real-time processing, even for high resolution and/or high speed image acquisition applications.

Table 3. Performance comparison of the GPA and reconfigurable architecture: throughput and speedup. Rank

Min/Max

Median

N

D2 [MPx/sec]

D3 [MPx/sec]

D3 D2

3×3

200

2650

13

5×5

154

2355

16

7×7

100

1649

16

3×3

125

1912

15

5×5

42

1188

29

7×7

22

896

40

Acknowledgment The author would like to thank Prof. Philippe Van Ham for his invaluable advice and support. 6. REFERENCES [1] Photo-Sonics, Photo-Sonics Phantom V4.0 High Speed Color Digital Camera Data Sheet, 2004. [Online]. Available: http://www.photosonicsinternational.co.uk/

ets) and the maximum operating frequency (F Clk ).

[2] J. Hwang and J. Jong, “Systolic architecture for 2-D rank order filtering,” Proceedings of the International Conference on Application Specific Array Processors, pp. 90–99, 1990.

4. PERFORMANCE COMPARISONS The implementation results from Table 2 can be used to determine the actual throughput of 32 computational modules using the following expression: D = 32 × η × FClk

[3] ——, “A new VLSI architecture for rank order and stack filters,” Proc.of the IEEE International Symposium on Circuits and Systems, May 1992.

(4)

[4] C. Chakrabarti, “High Sample Rate Array Architectures,” IEEE Transactions on Signal Processing, pp. 707–712, may 1994.

where η, the computational efficiency, is defined as fraction of computed pixels and pixels needed for the processing: η=

˜x Wx − (Nx − 1) W = Wx Wx

[5] G. Ravi and E. Paraskevas, “Design and Implementation of an Eficient General-Purpose Median Filter Network,” Digital Signal Processing, vol. 3, no. 1, pp. 64–72, jan 1993.

(5)

[6] K. L. Chung, “A fast pipelined median filter network,” Digital Signal Processing, vol. 51, no. 2, pp. 133–136, jan 1996.

For the considered neighborhood and window sizes, that is 20, 16 and 14 pixels, η will have the following values: 0.9, 0.8 and 0.7 respectively. This factor has to be taken into the account since some redundant memory accesses are necessary due to the window overlapping (border effect). Based on the system clock frequency (actually 80% of it) and the Eq.(4), the throughput of the reconfigurable system is calculated. Table 3 indicates the throughput of the GPA (D2 ), the one of the reconfigurable architecture (D 3 ) and the speedup.

[7] J. Smith, “Implementing Median Filters in XC4000E FPGAs,” XCELL, Q4(23), 1996. [8] N. Woolfries, P. Lysaght, S. Marshall, G. McGregor, and D. Robinson, “Fast Adaptive Image Processing in FPGAs Using Stack Filters,” Field-Programmable Logic : From FPGAs to Computing Paradigm, pp. 406–410, 1998. [9] A. Hiasat, M. Al-Ibrahim, and K. Gharaibeh, “Design and implementation of a new eficient median filtering algorithm,” VISP, vol. 146, no. 5, pp. 273–285, oct 1999. [10] T. Lee, J. H. Lee, and S. Cho, “FPGA implementation of a 3x3 window median filter based on a new efcient bit-serial sorting algorithm,” Proceedings of the 7th Korea-Russia International Symposium - KORUS 2003, pp. 237–242, 2003.

5. CONCLUSION In this paper we address the problem of fast ranking filter computation with two different computer architectures. First, we showed that a carefully optimized and fine tuned application using Pentium 4 SSE2 extension can reach the throughput of 20 to 200MB/sec, depending on the neighborhood size and rank. If such performance can meet the requirements of the real-time video processing, this will certainly not be the case if higher resolution and/or acquisition speeds are needed. The second architecture that we considered is based on a high density FPGA device. The proposed system combines optimized memory access and bit-serial algorithms

[11] A.Gasteratos, I.Andreadis, and Ph.Tsalides, “Non-linear Image Processing in Hardware,” Pattern Recognition, vol. 33, no. 6, pp. 1013–1021, sep 2000. [12] M. I. Fomitchev, “MMX technology code optimization,” Dr.Dobb’s Journal, 1999. [13] P. E. Danielsson, “Getting the median faster,” Computer Graph. Image Proc, vol. 17, pp. 71–78, 1981. [14] AlphaData, ADP-WRC-II Hardware Overview, 2004. [15] ——, ADP-WRC-II Product Specifications, 2004.

605