Memory bandwidth-constrained parallelism dimensioning for

required for the sequential execution of a parallel algorithm so as to estimate the number of tasks which ... An abstract many-core architecture. between the ..... Image processing algorithms are working with huge amounts of data, e.g. one of the smallest .... A study of replacement algorithms for a virtual-storage computer. IBM.
320KB taille 3 téléchargements 278 vues
Memory bandwidth-constrained parallelism dimensioning for embedded many-core microprocessors

SERGIU CARPOV a , RENAUD SIRDEY a , JACQUES CARLIER b and DRITAN NACE b (a) CEA, LIST, Embedded Real Time Systems Laboratory, Point Courrier 94, Gif-sur-Yvette, 91191 France. (b) UMR CNRS 6599 Heudiasyc, Université de Technologie de Compiègne, Centre de recherches de Royallieu, BP 20529, 60205 Compiègne Cedex, France. This paper deals, to some extent, with the problem of estimating a target degree of parallelism under a memory bandwidth constraint; such a target being used to guide a compilation chain functioning by parallelism reduction. After a formal denition of the problem, we prove its N P hardness and propose a two-stage heuristic method for solving it. We then provide preliminary computational results in order to evaluate the solutions obtained by the two-stage heuristic. Categories and Subject Descriptors: []: General Terms: Additional Key Words and Phrases: Combinatorial optimization, Many-core microprocessors, Scheduling

1. INTRODUCTION In this paper, we investigate the problem of evaluating the memory bandwidth required for the sequential execution of a parallel algorithm so as to estimate the number of tasks which may be executed in parallel (with respect to a hardware memory bandwidth constraint) on an embedded many-core microprocessor (see Fig. 1 for an illustration). Such an estimation can be used in a dataow programs compilation chain functioning by parallelism reduction, in order to x an appropriate target for the degree of parallelism. More precisely, we describe a method to estimate an achievable degree of parallelism for an algorithm, that is the ratio

Λ λ

Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for prot or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specic permission and/or a fee. c 20YY ACM 1529-3785/20YY/0700-0001 $5.00

ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY, Pages 10??.

2

·

Carpov Sergiu et al.

...

Fig. 1. An abstract many-core architecture. between the external memory bandwidth

Λ to the bandwidth λ required by an opti-

mal (in some sense dened later) sequential execution of the algorithm. We suppose that the algorithm is intrinsically parallel, so its structure in terms of parallelism is not an issue. A coarse estimation for

λ

can be obtained by dividing the total amount of data

needed to compute each of the tasks of an algorithm by a lower bound on the duration of a sequential execution. However, such an overly conservative estimation is not interesting from a practical point of view. Firstly because it does not take into account the fact that the memory accesses can be eciently managed, and secondly, because lower bounds on the execution time are barely achievable in practice. Thus, the above concepts must be considered for a more realistic estimation of

λ.

Let an algorithm be a set of independent tasks (no precedence relations between them) that are using data from an external memory. The tasks can use common data. Algorithm sequential execution time depends on the order in which the tasks are executed and on the way the data is managed (via data reuse). The goal of this study is to nd a task execution order which, combined with an appropriate data management, gives a good estimation of the maximum bandwidth required by a sequential execution. This paper is organized as follows. In the second section a survey on the exiting related work is presented. In the third section we give a formal denition of our problem and provide complexity results.

The fourth section is devoted to some

special cases. And nally, in the fth section a description of a two-stage heuristic is presented together with some experimental results.

2. RELATED WORK Previous work related to our problem is quite scarce. The main similarities between our problem and existing works mainly consist in the tools that are used to solve related though dierent problems. It appears that [Ding and Kennedy 2004] were the rst to study a problem relatively close to ours. In their work they study how to reorder program instructions so as to improve program data access locality, thus augmenting the data reuse obtained by the data caching policy.

Some earlier works, [Wolf and Lam 1991;

McKinley et al. 1996], describe methodologies for optimizing data reuse in program loops. In a series of two papers [Ding and Kennedy 1999; Ding and Orlovich 2004] describe two program transformations: locality grouping, which reorders program data accesses in order to improve data temporal reuse, and dynamic data pack-

ing, which consists in reorganizing data layout so as to improve data spatial reuse. ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.

Memory bandwidth-constrained parallelism dimensioning for embedded many-core microprocessors

Other papers [Pingali et al. 2003; Strout et al. 2003] describe similar approaches of data locality improvement and provide benchmark analysis.

3. PROBLEM STATEMENT AND COMPLEXITY RESULTS S = {1 . . . n} is the set E = {1 . . . m} is the set of algorithm inputs loaded from the E external memory, and, δ : S → 2 is a function such that δ (s) denotes the set of inputs needed for task s calculation. We make the hypothesis that there are no Let

(S, E, δ)

be a triplet denoting an algorithm, where

of algorithm tasks,

precedence relations between algorithm tasks and that several tasks can use the same external memory input.

(π, γ) be a pair denoting a task execution order and memory management, π : S → {1 . . . n} is an order of task execution and γ : S → 2E is a function that assigns a set of inputs γ (s) to each task s ∈ S . The set γ (s) gives the on-chip memory state at the beginning of task s calculation. It is evident that for each task s ∈ S relation δ (s) ⊆ γ (s) is veried. Remaining inputs, γ (s) \ δ (s), come Let

where

from data reuse. Data reuse is the process of reusing inputs already present in the on-chip memory, originating from previously calculated tasks. For a task execution order and memory management

(π, γ) the number of external

memory accesses is:

n    X γ sπ(i) \γ sπ(i−1) f (π, γ) = γ sπ(1) +

(1)

i=2 Throughout this paper we suppose that the available on-chip memory size is equal to

C , thus condition |δ (s)| ≤ |γ (s)| ≤ C

must be veried for each task

Problem 3.1 Task ordering and memory management problem. Let an algorithm and

C

f (π, γ)

(S, E, δ)

be

be the available on-chip memory size. Find a task execution

order and memory management accesses

s ∈ S.

(π, γ)

such that the number of external memory

is minimized.

Proposition 3.2. Task ordering and memory management problem is

N P-

hard. Proof. The problem of existence of a Hamiltonian path in an arbitrary graph

is

N P -complete

[Garey and Johnson 1979, p.199]. Using the following transforma-

tions we show that an instance of our problem can be reduced to the Hamiltonian path existence problem. Suppose that algorithm

(S, E, δ)

C (available on-chip |δ (s)| = C for any s ∈ S , and, that 0 two tasks, |δ (s) ∩ δ (s )| ≤ 1 for any

tasks are using exactly

memory size) inputs from the external memory, there is at most one common input between

s, s0 ∈ S . Let G = (S, A)

be an undirected graph whose vertices are algorithm tasks. The

i and j if the corresponding tasks are using A = {(s, s0 ) |∀s, s0 ∈ S, |δ (s) ∩ δ (s0 )| = 1}. An example of graph

graph contains an edge between vertices a common input,

G

is illustrated in Fig. 2. This instance of our problem has a solution of cost

it exists a Hamiltonian path in graph Hamiltonian path in graph

G

is

G.

n·C−n+1

if and only if

As far as the question of existence of a

N P -complete,

our problem is

N P -hard.

ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.

·

3

·

4

Carpov Sergiu et al.

Fig. 2. Example of a graph G associated to an instance of our problem. Here, the algorithm has 6 tasks. Each task is using exactly C = 4 inputs. Tasks s1 and s2 are using a common input e1 , so the graph has an edge between s1 and s2 , etc.

4. TASK ORDERING AND MEMORY MANAGEMENT PROBLEM SPECIAL CASES In order to avoid trivial cases, we suppose that the number of inputs is bigger than the on-chip memory size,

|E| > C .

4.1 Fixed task execution order Suppose that the task execution order management,

γ (s)

for any

s ∈ S,

π

is given and we need to nd a memory

so as to minimize the objective function (1).

For this special case a polynomial time algorithm is proposed, see Algorithm 1. It nds memory states

γ (s)

for each task

s ∈ S,

and also, the minimal number of

external memory accesses. This algorithm is based on the principle used in the optimal cache replacement 1

algorithm proposed by [Belady 1966]. In our context , it may be informally stated as follows: when a memory location is needed for a given input and the on-chip memory is full, free space should be obtained by dropping out the input which is to be used in the farthest future. The proposed algorithm is polynomial and has a complexity of be further optimized to a

O (n · log n · m)

 O n2 · m .

It can

complexity, but in order to facilitate the

comprehension the simplest version is provided.

4.2 One-step history limited data reuse Let us consider the task ordering and memory management problem for an algorithm

(S, E, δ).

We suppose that tasks are calculated one by one, and we can reuse

only the inputs loaded for previously calculated task, thus the data reuse is limited to one-step history. In this case the objective function (to minimize) becomes:

n  X   δ sπ(i) \δ sπ(i−1) f 0 (π, γ) = δ sπ(1) + i=2 For a given algorithm

(S, E, δ) we build a digraph G = (V, A, c), dened as V = S ∪ {sd } contains algorithm tasks plus a dummy use any input δ (sd ) = ∅. The set of edges A dene a

follows. The set of vertices task

sd ,

which does not

1 We

do not directly use the algorithm described in Belady's paper, because in their model at each step only a single memory location is loaded. Contrary to our model, where at each step we can load more than one input. ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.

Memory bandwidth-constrained parallelism dimensioning for embedded many-core microprocessors

Algorithm 1

Optimal on-chip memory management algorithm for the task per-

1, 2, . . . , n. γ (1) ← δ (1) γprev ← δ (1) N ← |δ (1)| {N - total number of external memory accesses} for i = 2 to n do γ ← δ (i) N ← N + |δ (i) \ γprev | {for task i only |δ (i) \ γprev | accesses} j ←i+1 while j ≤ n and |γ| < C do Find a ⊆ (δ (j) \ γ) ∩ γprev that satises |γ| + |a| ≤ C γ ←γ∪a j ←j+1

mutation

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

end while

γ (i) ← γ γprev ← γ

end for

complete graph. The cost function

c:A→N

(s, s0 ) the task s, thus

associates to each edge

number of external memory accesses needed to execute task

0

s

after

c (s, s0 ) = |δ (s0 ) \ δ (s)|. Then, it can be easily shown that the solution of the asymmetric traveling salesman (TSP) problem applied on graph

G

gives the optimal solution to the task or-

H c (i, j) ≤ (i,j)∈H function c is replaced

dering and memory management problem with one-step limited data reuse. Let be the lowest cost Hamiltonian circuit in graph

P

(i,j)∈H 0

c (i, j)

is true for any other circuit

G,

H 0.

then relation

If the cost

P

with its denition we obtain the objective function of the one-step limited data reuse problem, taking into account that Although, the asymmetric TSP is an

δ (sd ) = ∅. N P -complete

problem, it allows to solve

our problem special case using well studied exact or approximate algorithms, see [Carpaneto et al. 1995].

5. A TWO-STAGE HEURISTIC 5.1 A simple two-stage heuristic In this section we describe a simple heuristic for solving the task ordering and memory management problem, see Algorithm 2. heuristic method is, rstly, the

N P -hardness

Our motivation to introduce a

aspect of the problem, and secondly,

the necessity in a compilation chain to nd solutions to the problem in a reasonable time. The idea behind this heuristic is to divide the problem into two sub-problems and solve each of them independently.

The two special cases described in the

previous section will be employed. The rst sub-problem consists in nding a task ordering that minimizes the total number of external memory accesses when only an one-step history is permitted. It is solved using any approximate method for the asymmetric TSP problem. The second sub-problem is to nd a memory management for a given task ordering, which is solved in polynomial time using Algorithm 1. ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.

·

5

·

6

Carpov Sergiu et al.

Algorithm 2 The two-stage heuristic Input: An algorithm (S, E, δ). 1: 2: 3: 4:

G = (S ∪ {sd } , A, c) as dened in Subsection 4.2. σ in graph G using an approximate TSP method. sd from the sequence σ : σ 0 = σ \ {sd }.

Build the complete digraph

Find the Hamiltonian circuit Remove the dummy task

Calculate the number of external memory accesses by applying Algorithm 1 on task sequence

σ0 .

5.2 Exact Branch&Bound method In order to evaluate the two-stage heuristic performance an exact problem solver is needed. We developed a Branch&Bound method for nding optimum solutions, see [Carpov 2008]. The Branch&Bound algorithm starts with an empty task ordering. At each branching decision it adds to the end of this ordering a new task not yet ordered. A leaf is obtained when all tasks are ordered. A lower bound as well as dominance relation are used to reduce the search space. Without going into detail, in the lower bound calculation we make the hypothesis that the inputs used by the not yet ordered tasks will be accessed only once. Although not as tight as we may have hoped, this bound is computationally cheap and still allows to signicantly prune the search tree. Still, even for a special case of our problem obtaining a cheap, tight lower bound appears to be quite dicult, see [Ruiz et al. 2008]. Additionally, the proposed dominance relation divides the tasks into independent sub-sets (tasks that do not have any common inputs) and applies the Branch&Bound method onto each sub-set separately.

5.3 Two-stage heuristic evaluation The hypothesis about intrinsically parallel algorithms constrains us to limit computational experiments to easily parallelizible algorithms. A good example of these are image processing algorithms. A highly parallel execution is possible for many of them. Image processing algorithms are working with huge amounts of data, e.g. one of the smallest image resolution being course, the

N P -hardness

640 × 480

pixels. Because of this fact, and of

of the task ordering and memory management problem,

exact resolution of practical instances is out of reach of even the most sophisticated methods.

In order to be able to compare the two-stage heuristic with the exact

Branch&Bound method, we limit our computational results to small instances of image processing algorithms. Thereafter, we do so for the classical image processing primitive: image convolution (see [Gonzalez and Woods 2001] for more details). Image convolution algorithm calculates the convolution product of an image with a kernel

K,

P P i

j I [p − i, q − j] · K [i, j].

I

It computes the value of an output

image pixel in function of its neighborhood pixels in the input image, for our experiments we use a

3×3

square neighborhood. We suppose that a task calculates

the convolution product for one output pixel, so each task uses convolution instance we take a will be

25

7×7

9

inputs. As image

input image, in this case the number of tasks

(output image pixels belonging to image boundaries are not calculated).

ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.

Memory bandwidth-constrained parallelism dimensioning for embedded many-core microprocessors

1

2

3

4

5

6

7

1 2 3 4 5 6

0 1 2 3 4 9 8 7 6 5 10 11 12 13 14 19 18 17 16 15 20 21 22 23 24

7

Fig. 3. Image convolution optimal calculation order, C = 9. Buer size, C Optimal solution Heuristic

9 81 81

11 65 71

13 57 63

15 55 57

17 49 53

19 49 51

21 49 49

Table I. Comparison between the exact solution (Branch&Bound) and the approximate solution (two-stage heuristic). When the on-chip memory size has the minimum possible value,

C = 9, the optimal

calculation order of output image pixels is illustrated in Fig. 3. In Table I are presented the optimal and the approximate number of external

7 × 7 image convolution example, with on-chip 21. We note that for the minimum possible buer

memory accesses for the above buer sizes ranging from size,

C = 9,

9

to

the solutions coincide. This is explained by the fact that for the image

convolution with a minimal buer size, the data reuse is limited to one-step history. As the buer size grows, the two-stage heuristic give worse solutions then the exact method. This is due to the fact the optimal task ordering does not coincide with the one-step history limited task ordering. Finally, the minimum possible number

49 = |E|, C = 21.

of memory accesses, the heuristic for

is obtained by the exact method for

C = 17

and by

Some others image processing algorithm instances (e.g. the Hough transformation) have been used to test the proposed heuristic, giving the same kind of insights. We must note that when the on-chip buer has minimum possible sizes only a small amount of data can be reused, so the two-stage heuristic gives near optimal solutions.

6. CONCLUSIONS In this paper, we have introduced and examined the task ordering and memory management problem. The main goal is to nd a task execution order and an external memory data loading strategy so as to minimize the total number of external memory accesses for an algorithm. The data loading management is constrained by the available on-chip memory size, that is an optimal data management strategy is needed. The

N P -hardness

of the problem have been proved and two special cases were

described.

Using this special cases a two-stage approximate heuristic have been

proposed.

In order to evaluate the heuristic performance an exact method has

been proposed and several computational experiments have been realized.

The

algorithms considered in this experiments are from the image processing eld, one ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.

·

7

8

·

Carpov Sergiu et al.

of the main application domain for embedded many-cores. The obtained results are encouraging. Especially in the case when the on-chip memory has the minimum possible size, the approximate solutions being close to optimal ones. A subsequent paper will focus on a generalization of the task ordering and memory management problem, in which the task execution time will be taken into account.

In the timed problem, data prefetching is envisageable as optimization

criterion, not only the data reuse. Some preliminary work revealed a special case of the timed problem when there is no common data between the task, which can be solved polynomially using Johnson's algorithm.

REFERENCES 1966. A study of replacement algorithms for a virtual-storage computer. IBM 2, 78101. Carpaneto, G., Dell'Amico, M., and Toth, P. 1995. Exact solution of large-scale, asymmetric traveling salesman problems. ACM Trans. Math. Softw. 21, 4, 394409. Carpov, S. 2008. Optimisation du préfetch et du parallélisme pour plateforme MPSoC. M.S. thesis, Université de Technologie de Compiègne. Ding, C. and Kennedy, K. 1999. Improving cache performance in dynamic applications through data and computation reorganization at run time. SIGPLAN Not. 34, 5, 229241. Ding, C. and Kennedy, K. 2004. Improving eective bandwidth through compiler enhancement of global cache reuse. J. Parallel Distrib. Comput. 64, 1, 108134. Ding, C. and Orlovich, M. 2004. The potential of computation regrouping for improving locality. In SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing. IEEE Computer Society, Washington, DC, USA, 13. Garey, M. R. and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA. Gonzalez, R. C. and Woods, R. E. 2001. Digital Image Processing. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. McKinley, K. S., Carr, S., and Tseng, C.-W. 1996. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18, 4, 424453. Pingali, V. K., McKee, S. A., Hsieh, W. C., and Carter, J. B. 2003. Restructuring computations for temporal data cache locality. Int. J. Parallel Program. 31, 4, 305338. Ruiz, R., “erifo§lu, F. S., and Urlings, T. 2008. Modeling realistic hybrid exible owshop scheduling problems. Comput. Oper. Res. 35, 4, 11511175. Strout, M. M., Carter, L., and Ferrante, J. 2003. Compile-time composition of runtime data and iteration reorderings. In PLDI '03: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation. ACM, New York, NY, USA, 91102. Wolf, M. E. and Lam, M. S. 1991. A data locality optimizing algorithm. In PLDI '91: Belady, L. A.

Systems Journal 5,

Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation

. ACM, New York, NY, USA, 3044.

ACM Transactions on Computational Logic, Vol. V, No. N, Month 20YY.