AN FPGA SOLVER FOR WSAT ALGORITHMS Kenji ... - Xun ZHANG

achieve high performance, and (3) fast data-downloading for each instance. We implemented the solver for problems up to 256 variables and 1024 clauses on ...
75KB taille 1 téléchargements 351 vues
AN FPGA SOLVER FOR WSAT ALGORITHMS Kenji Kanazawa and Tsutomu Maruyama Systems and Information Engineering, University of Tsukuba 1-1-1 Ten-ou-dai Tsukuba Ibaraki 305-8573 JAPAN [email protected] ABSTRACT WSAT and its variants are one of the best performing stochastic local search algorithms for the satisfiability (SAT) problem. In this paper, we propose a new FPGA solver for WSAT algorithms. The features of our solver are (1) high parallelism by small size units to evaluate clauses in each instance of the SAT problem, (2) multi-thread execution to achieve high performance, and (3) fast data-downloading for each instance. We implemented the solver for problems up to 256 variables and 1024 clauses on XC2V6000, and it used 45% of slices and five block RAMs. Our implementation shows higher performance over previous SAT solvers on FPGAs.

2. THE SATISFIABILITY PROBLEM The satisfiability (SAT) problem is a very well-known combinatorial problem. An instance of the problem can be defined by a given boolean formula , and the question is to find an assignment of binary values to the variables which makes the formula true. Typically, is presented in conjunctive normal form (CNF), which is a conjunction of a number of clauses, where a clause is a disjunction of a number of literals. Each literal represents either a boolean variable or its negation. For example, in the following formula in CNF ( are clauses: , and have two literals, and has three literals), satisfies all clauses, namely the formula .

1. INTRODUCTION Stochastic local search algorithms show very good performance for the satisfiability (SAT) problem. WSAT and its variants [1, 2] are one of the best performing stochastic local search algorithms. These algorithms are well-suited to hardware implementation because they do not require complex control structures and have an inherent parallelism. With a custom design specific to a particular SAT problem instance, we can achieve the highest performance, but this approach is not realistic because the synthesis of the instance specific design requires longer time than solving the instance by a micro processor. Therefore, we need a solver that can solve any kinds of instances without re-synthesis. In the implementation, it is very important to minimize the size of units to evaluate clauses in each instance. With smaller units, we can evaluate more clauses in parallel and solve larger instances more efficiently. It is also important to minimize data-downloading time for each instance. In this paper, we propose a new implementation of FPGA solver for 3-SAT problems. The features of our implementation are (1) high parallelism by small size units to evaluate clauses, (2) multi-thread execution to achieve high performance, and (3) fast data-downloading for each instance. Our FPGA solver can solve larger problems than previous works with less hardware resources, and shows higher performance over them.

0-7803-9362-7/05/$20.00 ©2005 IEEE

If there are literals in each clause, the problem is called a -SAT problem, and -SAT problem ( ) is the first problem shown to be -complete. A -SAT problem can be transformed to a -SAT problem by introducing new variables. Many algorithms and hardware solvers have been proposed to date. Algorithms for solving the SAT problem can be divided into two major groups: complete and incomplete. The complete algorithms can always find a solution (or conclude that the problem is unsatisfiable). The incomplete algorithms do not guarantee to find a solution. When a solution can not be found by those algorithms it is impossible to determine whether the problem is unsatisfiable or the algorithms could not find the solution. Nevertheless, these algorithms are of particular interest, because they are very effective in many large problems, and can be used to solve the maximum satisfiability problem. WSAT [1] is one of the best performing stochastic local search algorithms. Figure 1 shows the outline of the procedure of WSAT algorithms. The procedure begins by considering a random truth assignment. It searches for a solution

83

procedure WSAT input a CNF formula F, MAX-FLIPS and MAXTRIES begin for i in 1 to MAX-TRIES T = randomly generated truth assignment for j in 1 to MAX-FLIPS if T satisfies F then return T c = a random unsatisfied clause v = a variable in c chosen by a heuristic H T = T with v flipped end for end for return “no satisfying assignment found” end Fig. 1. The Procedure of WSAT Algorithms by repeatedly selecting an unsatisfied clause at random, and then employing some heuristics to select a variable in that clause to flip (change from true to false or vice-versa). In the procedure in Figure 1, the parameters MAX-TRIES (the number of new search sequence) and MAX-FLIPS (the number of variables values flips per try) are used to control the maximum runtime of the algorithm. Six heuristics for selecting a variable in a clause were considered in [2]. Here, we introduce two of them. WSAT/G: With probability pick any variable, otherwise pick a variable that minimize the total number of unsatisfied clauses. The value is the noise parameter, which ranges from 0 to 1. WSAT/SKC: For each variable in the clause, count the number of clauses that are true in the current assignment, but that would become false if the flip were made (this value is called a break-value). If variables with break value of 0 exist, pick any of them. If not, with probability , pick any variable, otherwise pick a variable that gives the minimum break-value.

In [5], pure random WSAT algorithm ( = 1.0 in WSAT/G, namely a variable is always selected at random) was implemented on XCV300, and it could solve problems of 50 variables and 170 clauses. The performance of the circuit was 363.7K flips per second (fps). In this implementation, values of 50 variables are broadcasted to 170 clause checkers which consists of ROMs. In order to change values in the ROMs to solve other instance, a new bit-stream configuration for the instance is generated by partially modifying the bit-stream. With this implementation, large problems can not be solved because (when is ) LUTs are required for clause checkers. In [6], pure random WSAT and WSAT/G were implemented on XCV1000, and it could solve problems of 100 variables and 220 clauses. The circuit run at 20MHz, and its performance was 20M fps in pure random WSAT (pipelined) and 2.2M fps in WSAT/G (not pipelined). In this implementation, bit-stream configuration is not modified, but clause data for each instance is generated on the host computer, and downloaded to the FPGA in order to maintain high portability. It takes 7.6ms to generate and download the data to the on-board SRAMs, and FPGA takes clock cycles to read the data. The total size of clause evaluators (same as clause checkers in [5]) is proportional to (when is ) again, and it was estimated that problems with 100 variables and 600 clauses will be solved with one XC2V6000. 4. OUR APPROACH In our approach, clause data for each instance are generated on the host computer and downloaded to the FPGA to maintain high portability of the circuit. The features of our approach are as follows. 1. Clause evaluators are small. The total size of the evaluators is almost proportional to only . The size of one evaluator is 22 LUTs (up to 256 variables), and 25 LUTs (up to 4096 variables), which is much smaller than (with one XC2V6000, we can solve the problems of 256 variables and 2048 clauses or 2048 variables and 1792 clauses).

3. PREVIOUS WORKS

2. The circuit is completely pipelined, and achieves high performance by multi-thread execution.

Many hardware solvers have been proposed to date. They were surveyed in [3] in detail. Here, we focus on FPGA solvers for incomplete algorithms. Incomplete algorithms are well-suited to a reconfigurable hardware implementation because they do not require complex control structures and have an inherent parallelism. However, fast incomplete algorithms are primarily intended for processing large CNF formulae, for which complete algorithms may not be applicable. Therefore, it is a challenging problem to solve large problems with limited hardware resources of FPGAs. In the following discussion, the number of variables in a formula is denoted , and the number of clauses .

3. Data downloading time is fast (1024 (for clause definition) + 384 (for values of literals in the clauses) clock cycles for problems of 256 variables and 1024 clauses). This time is almost proportional to only . 4. Two kinds of heuristics (WSAT/G and WSAT/SKC) are supported without reconfiguration. WSAT/SKC outperforms WSAT/G, but it requires status of each clause before flipping and after flipping a variable. Therefore, WSAT/SKC could not be implemented in

84

Clause Table

V0 V1 V2

.............

Current Clause | Current Variable

V0 V1 V2

C1

C2

Clause Selector

Score Adder

Variable Selector flip

Random Number Generator Random Number Generator

...............................................

CNc-1

Clause Evaluators

C0

V0 V1 V2

Variable Table

Fig. 2. A Block Diagram of Our Circuit previous works. We did not support pure random WSAT (it could be built in our circuit very easily), because its search ability is weak, and it can not solve large problems. 4.1. Details of the Pipeline Processing in Our Approach Figure 2 shows a block diagram of our approach. The clause table holds variable numbers in each clause. We do not need information whether each literal in clauses is a negation or not, because the variable numbers are just used to flip literal values in clause evaluators. The three variable numbers in the current clause are sequentially read out and broadcasted to clause evaluators. If the broadcasted variable number is same as one of the three variable numbers in each clause evaluator, the corresponding value in the clause evaluator is flipped, and the number of clauses (in WSAT/G) or the break-value (in WSAT/SKC) is counted by score adder (the output is called score). One of unsatisfied clauses is selected by clause selector at the same time. Thus, three sets of the broadcasted variable number, its score and the unsatisfied clause number are stored in variable selector. Then, one of the sets is selected by variable selector comparing the three scores, and the other two sets are discarded. The value of the selected variable in the Variable Table is flipped (this table is used just to output a solution)), and the clause in the selected set becomes the next current clause.

Our circuit consists of 10 pipeline stages. As described above, before selecting a variable which will be flipped in the current clause, we need to evaluate clauses and count scores by temporally flipping three variables in the current clause. Therefore, we need 12 clock cycles for one flip, and 3 stages on the pipelined circuit are occupied for one flip. This means that the circuit becomes idle for nine clock cycles. With this nine clock cycles, we can flip three more variables. But, these three flips can not be the successors of the first flip, because the calculation for the first flip is still on the pipeline. In order to utilize these nine clock cycles, four independent tries (as shown in Figure 1, search is repeated for MAX-TRIES changing initial assignment) are executed at the same time on the pipeline (multi-thread execution). Figure 3 shows the flow by the multi-thread execution. The effectiveness of the multi-thread execution can be examined by simulation. Table 1 shows the number of flips to find a solution by WSAT/SKC with different number of threads (data in Table 1 are the average of 50 runs). We used uf225960 benchmark programs [7] (programs with 225 variables and 960 clauses). Among one hundred instances, uf225-087 is the easiest, and uf225-039 is the hardest. As shown in Table 1, the performance of the multi-thread execution is almost proportional to the number of threads in total (it does not work well for easy instances, but easy instances are not the target of hardware solvers in general). Figure 4 shows the details of a clause evaluator. A clause evaluator consists of three pairs of two variable number arrays and a literal value table. The values in the variable number arrays and the literal value table are initialized for each instance by downloading data from the host computer. A variable number broadcasted from the clause table is divided to 4 bits, and used as addresses for the variable number arrays. In Figure 4, (three literals in a clause) (for each literal) arrays are used to test if the broadcasted variable number is in the clause evaluator. By adding one more array to each pair (three in total), we can test up to 4096 variables (12 bits). A literal value table consists of a dual port distributed RAM, and holds values of a literal. The value in a literal value table is flipped if the output of the corresponding variable number arrays are all true (its variable number is same as the broadcasted variable number). Four bits in the literal value table is used for one thread in the multi-thread execution, and 16 bits are used by four threads. One of the four bits is used to hold the current value of a literal, and other three bits are used to store new values of the literal generated by the temporal flipping of the literal value. Then, one of the three new values is used as a new current value of the literal. Figure 5 shows how to manage the position of the current value (current position) in the literal value tables. The current position is held on two bit register, and is used to access address generator after combined with a counter which

85

12

First Thread Second Thread

0 1 2 3 0 1 2 0 1 0

Third Thread

4 3 2 1 0

5 4 3 2 1 0

12 6 5 4 3 2 1 0

7 6 5 4 3 2 1 0

8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

Fourth Thread

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

0 1 2 3 0 1 2 0 1 9 0 8 9 7 8 9 6 7 8 9 5 6 7 8 4 5 6 7 3 4 5 6 2 3 4 5 1 2 3 4

4 3 2 1 0

9 8 7 6 5

5 4 3 2 1 0

12 6 5 4 3 2 1 0

7 6 5 4 3 2 1 0

8 7 6 5 4 3 2 1 0

9 8 9 7 8 9 6 7 8 9

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

0 1 2 3 0 1 2 0 1 9 0 8 9 7 8 9 6 7 8 9 5 6 7 8 4 5 6 7 3 4 5 6 2 3 4 5 1 2 3 4

4 3 2 1 0

9 8 7 6 5

5 4 3 2 1 0

6 5 4 3 2 1 0

7 6 5 4 3 2 1 0

8 7 6 5 4 3 2 1 0

9 8 9 7 8 9 6 7 8 9

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

.............................. 9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2

...................... 9 8 7 6 5 4 3

9 8 7 6 5 4

9 8 7 6 5

.......... 9 8 9 7 8 9 6 7 8 9

....

Fig. 3. Multi-thread Execution Table 1. Performance of the Multi-Thread Execution Number of Threads 1 2 4 8 16

Average(total) #flips speedup 78882 40891 1.93 20162 3.91 10305 7.65 4974 15.85

uf225-087.cnf #flips speedup 1936 1020 1.90 814 2.37 619 3.13 469 4.13

Table 2. The Number of Flips by Fair and Unfair Clause Selection

init. data #variable[3:0] #variable[7:4]

uf225-039.cnf #flips speedup 2690011 1593729 1.69 600537 4.48 316113 8.51 131622 20.44

Variable Number Array

Variable Number Array

Variable Number Array

Variable Number Array

Variable Number Array

Variable Number Array

reg

reg

reg

reg

reg

reg

Selection Method fair unfair

WSAT/SKC 20050.6 18288.7

WSAT/G 161333.5 172277.4

init. data And

And

Xor

And

Xor

Xor

init. data

WSAT/G or WSAT/SKC

to Score Adder

Or

thread3 thread2 thread1 thread0

Literal Value Table

thread3 thread2 thread1 thread0

Literal Value Table

thread3 thread2 thread1 thread0

Literal Value Table

init. data

Or

to Clause Selector

Fig. 4. The Details of the Clause Evaluator generates 0, 1 and 2. The current position combined with the thread number gives the address to read out current value, and the output of the address generator combined with the thread number gives the address to store new value. The current values in the three literal value tables and the new values generated by temporal flipping are ORed respectively (the current status and new status of the clause). The new status is sent to clause selector, and the current and new status are sent to score adders through a LUT to calculate the break-value in WSAT/SKC (this LUT works as a selector to select the new status in WSAT/G). The score adder is simply a binary adder, which consists of 512 2-bit adders, 256 3-bit adders, 128 4-bit adders and so on, when is 1024.

In the clause selector, one of unsatisfied clauses is selected at random. The problem here is that it is very difficult to select one unsatisfied clause with fair selection strategies (all unsatisfied clauses are selected with equal probability) on pipeline circuits. In our implementation, the unsatisfied clause is selected by selectors connected as binary tree. Suppose that clauses in the first half (clause 0 - 511 when is 1024) are not satisfied, and clauses in the second half (clause 512-1023) are not satisfied. Then, one of clauses is selected at , and one of clauses is selected at , even if . Table 2 compares the number of flips to find a solution under a fair selection and our unfair selection by simulation. Data in Table 2 are the average of 50 runs of all instances in uf225-960 benchmark programs with multithread execution. As shown in Table 2, the effect of unfair selection can be ignored. 4.2. Data Down-loading Data downloading time for each instance is especially important when the size of the problem is not large, or the instance can be solved easily, because we should consider this time as a part of the execution time by hardware solvers. In our approach, we need to update four kinds of tables for each instance. 1. one clause table 2. 3.

86

variable number arrays literal value tables

V2 V1 V0

4. one variable table bits The size of the clause table is ( bytes when is 1024 and is 256). These data are downloaded from the host computer. In order to simplify the downloading to the clause table, we used one word (4 bytes) for each clause (8 bits are not used here). The clause table consists of four block RAMs (when is 256), and their read ports are configured as (which means that we can process up to 2048 clauses), while write ports are configured as (only 8 bits are used) to store the data in clock cycles. This configuration can be changed for larger problems. The size of the variable table is (number of the threads) bits ( bits when is 256). These data are downloaded in parallel with the data for the clause table using one of the unused 8 bits. We need to initialize the values in literal value tables according to the values in the variable table. These values are also downloaded from the host computer. In order to minimize data downloading time, the literal value tables are divided to 32 groups (data bit width for downloading), and those in each groups are connected like shift registers. We need to download (number of the threads) words (384 words when is 1024). If the data for the variable number arrays are downloaded from the host computer, we need to download bits, which amounts to 3072 words when is 1024. Therefore, we designed a unit to generate data for the variable number arrays from the data for the clause table. Figure 6 shows the outline of the unit. In Figure 6, when the data for one clause (three variable numbers) are downloaded to the clause table, one bit in each six single port distributed RAMs (1b 16) on plane-A is set using the variable numbers as addresses (lower half and upper half of the variables numbers are used as addresses). This operation is repeated 32 times (thus 192 distributed RAMs are set). Then, data in the 192 distributed RAMs are read out (and sent to the variable number arrays) in parallel, and the data which are

Read Address

Literal Value Table

0,1,2

1 2 3 0 2 3 0 1 3 0 1 2 -

Write Address

Counter

Thread Number

Address Generator

Current Position

Fig. 5. The Details of the Management of the Current Position

Plane-B

...............

Plane-A

Fig. 6. A Unit for Initializing Variable Number Arrays read out are cleared. This read/clear cycle (which takes 2 clock cycles) is repeated 16 times in 32 clock cycles. The variable number arrays are divided to 192 groups, and connected as shift registers. It also takes 32 clock cycles to shift 16 bits data because the variable number arrays are single port distributed RAMs. At the same time, 192 distributed RAMs on plane-B are set using the variable numbers for the next 32 clauses. Then, the role of the plane-A and plane-B is flipped. With this unit, we can initialize the variable number arrays in parallel with downloading for the clause table. This initialization takes clock cycles. clock Thus, we can finish all initialization in cycles when is 1024. 5. RESULTS We implemented a solver for 256 variables and 1024 clauses on XC2V6000 on ADM-XRC-II by Alpha Data. The circuit runs at 66MHz, and it occupied 45% of slices and five block RAMs. A solver for up to 256 variables and 2048 clauses (or 2048 variables and 1792 clauses) can be implemented on one XC2V6000. Table 3 compares the performance with a software program (WalkSAT Version 35[8]) using some of uf225-960 benchmark programs (225 variables and 960 clauses). In this comparison, WSAT/SKC is used, because it can find solutions faster than WSAT/G. Table 3 shows total number of flips to find a solution, average flip rate (K flips per second (Kfps)) and speedup compared with the software program on AMD Athlon MP 2200+ processor. Data in Table 3 are the average of 50 runs. In our solver, it takes about 2.7 msec to generate download data for each instance on AMD Athlon MP 2200+ processor, and it takes about 2.8 msec to download the data to the FPGA, and to store them in the tables in the circuit. In our system, data are directly received by the FPGA. Therefore, it seems that most of this time is used for setting DMA transfer on our system, because data transfer rate is about 191MB/sec on 66MHz 32b PCI bus. We think that this time can be improved by modifying the driver program. As shown in Table 3, our solvers showed good results for problems which require more number of flips such as uf225032 and uf225-039, though the speedup for easier problems is not good, because of the time to generate and download instance specific data into the circuit, and less effectiveness

87

Table 3. Performance Comparison Our FPGA Solver Instance uf225-087 uf225-026 uf225-028 uf225-091 uf225-032 uf225-039

Software total 1691 2231 10922 16347 1027208 2726196

Kfps 839 685 780 761 776 770

without multi-thread total Kfps Speedup 1936 331 0.37 2024 345 0.57 11723 1536 1.83 14962 1820 2.62 1330132 5378 5.35 2690011 5439 7.16

of the multi-thread execution. The average flip rate for easier problems is worse than the average flip rate reported in [6]. The reason is as follows. In our solver with multi-thread execution and WSAT/SKC algorithm, the total number of flips to find a solution (namely the total execution time) is much less than [6]. Therefore, the time for generating and downloading data for each instance (our time is a bit faster than [6]) occupies the most of the execution time of easier problems. Therefore, the average flip rate becomes worse as the solver find a solution with less number of flips, namely, as the solver achieves higher performance. All clauses are evaluated in parallel for each flip of variables in our solver, because it makes the control of the solver much easier (especially the control of the pipeline), but in software, only clauses which include the flipped variable are re-evaluated. Therefore, the true parallelism is not the number of clauses. For example, the average number of clauses which are evaluated for each flip of variables (including temporal flipping) in uf225-039 in software is about 12, which can be easily guessed by the number of clauses and variables ( ). This is the reason of the low performance gain against very high parallelism on the FPGA solver. 6. CONCLUSIONS In this paper, we described that our FPGA solver for WSAT algorithms showed better performance than software and previous hardware WSAT solvers. First, we could solve larger problems than previous works, because the size of the units to evaluate clauses is very small in our solver. Our solver implemented on XC2V6000 for problems of 256 variables and 1024 clauses used 45% of slices, and five block RAMs. The size of our circuit is almost constant to the number of variables (up to 4096), and can be considered almost proportional to the number of clauses. Therefore, with larger FPGAs, we can solve larger problems more efficiently than previous works. Second, the multi-thread execution introduced to fully utilize all pipeline stages showed good results. With four-thread execution, we could achieve almost four times speedup for large problems. As for the data downloading time, time for setting up DMA transfer is large, and the effectiveness of the unit for fast data downloading is not

with multi-thread total Kfps Speedup 814 576 0.38 842 596 0.59 3801 2456 2.26 5010 3126 3.35 280742 19860 23.4 600537 20945 30.9

clear. However, because of the unit, we do not need to prepare data for variable number arrays on the host computer, and we could reduce the time to prepare the download data. The true parallelism in SAT problems is not the total number of clauses, but the number of clauses which include a variable that will be flipped. Therefore, we need to develop other kinds of hardware solvers which re-evaluate only those clauses in order to solve very larger problems under limited hardware resources of FPGAs. 7. REFERENCES [1] B. Selman, H. A. Kautz, and B. Cohen, “Noise Strategies for Improving Local Search”, AAAI-94, pp. 337– 343. [2] D. McAllester, B. Selman, and H. Kautz, “Evidence for Invariants in Local Search”, AAAI-97, pp. 321–326. [3] I. Skliarova, and A. B. Ferrari, “Reconfigurable Hardware SAT solver: A survey of Systems”, IEEE Transaction on Computers, Vol.53, No.11, pp. 1449–1461, 2004. [4] , W. H. Yung, Y. W. Seung, K. H. Lee and P. H. W. Leong, “A Runtime Reconfigurable Implementation of GSAT Algorithm”, FPL99, pp.526–531. [5] P. H. W. Leong, C. W. Sham, W. C. Wong, H. Y. Wong, Y. S. Yuen and M. P. Leong, “A Bitstream reconfigurable FPGA implementation of the WSAT algorithm”, IEEE Transaction on Very Large Scale Integration Systems 9(1), pp. 197–200, 2001. [6] R. Yap, S. Wang and M. Henz, “Real-time Reconfigurable Hardware WSAT Variants”, FPL03, pp. 488– 496, 2003. [7] http://www.intellektik.informatik.tudarmstadt.de/SATLIB/benchm.html [8] http://www.intellektik.informatik.tudarmstadt.de/SATLIB/solvers.html

88