CISC, RISC and post-RISC architectures

execution pipeline, functional units, memory organization, . ... if it's an arithmetic operation, execute it in an ALU. – if it's a ..... logical registers to physical registers.
1MB taille 1 téléchargements 341 vues
Microprocessor architecture and instruction execution ■  ■  ■  ■  ■  ■  ■  ■  ■  ■ 

CISC, RISC and Post-RISC architectures Instruction Set Architecture and microarchitecture Instruction encoding and machine instructions Pipelined and superscalar instruction execution Hazards and dependences Branch prediction Out-of-order instruction execution 32- and 64-bit architectures Multicore architectures and hyperhreading Processor architectures for embedded systems

1

CISC, RISC and post-RISC architectures ■  CISC – Complex Instruction Set Computer –  –  –  –  –  – 

large instruction set instructions can perform very complex operations, powerful assembly language variable instruction formats large number of addressing modes few registers machine instructions implemented with microcode

■  RISC – Reduced Instruction Set Computer –  –  –  –  –  – 

relatively few instructions simple addressing modes, only load/store instructions access memory uniform instruction length many registers no microcode pipelined instruction execution

■  Modern processors have developed further from the basic ideas behind RISC architecture

2

1

Post-RISC architecture ■  Modern processors have developed further from the basic ideas behind RISC architecture –  exploit more instruction level parallelism

■  Characteristics: –  –  –  –  –  – 

parallel instruction execution (superscalar) deep pipeline (superpipelined) advanced branch prediction out-of-order instruction execution register renaming extended instruction set

3

Instruction Set Architecture ■  An abstract description of a processor as it is seen by a (assembly language) programmer or compiler writer –  abstract model of a processor –  defines the instructions, registers and mechanisms to access memory that the processor can use to operate on data

■  Specifies the –  –  –  – 

registers machine instructions and their encoding memory addresses addressing modes

■  Examples: Intel IA-32, Intel 64, AMD-64 –  defines a family of microprocessors, from the 8086 (1978) to the Intel Core i7 –  all binary compatible (within certain limits)

■  Intel 64 and IA-32 Architectures Software Developer's Manuals, available at http://www.intel.com/products/processor/manuals 4

2

Microarchitecture ■  The microarchitecture of a processor defines how the ISA is implemented in hardware –  defines how the functionality of the ISA is implemented –  execution pipeline, functional units, memory organization, ...

■  Example (Intel processors) –  –  –  – 

P6 microarchitecture – from Intel Pentium Pro to Pentium III Netburst microarchitecture – Pentium 4, Xeon Core microarchitecture – Core 2, Xeon Nehalem microarchitecture – Core i5, Core i7

■  The physical details (circuit layout, hardware construction, packaging, etc.) is an implementation of the microarchitecture –  two processors can have the same microarchitecture, but different hardware implementations –  for instance 90 nm transistor technology, 60 nm, 45 nm high-k metal gate technology or 32 nm technology

5

Instruction encoding ■  Assembly language instructions are encoded into numerical machine instructions by the assembler ■  Instruction formats can be of different types –  variable length •  supports varying number of operands •  typically used in CISC architectures: PDP 11, VAX, Motorola 68000

–  fixed format •  •  •  • 

always the same number of operands addressing mode is specified as part of the opcode easy to decode, all instructions have the same form typically used in RISC architectures: SPARC, PowerPC, MIPS

–  hybrid format •  multiple formats, depending on the operation •  used in most Intel and AMD processors: IA-32, Intel 64, AMD-64 •  machine instructions are split into micro-operations before they are executed

6

3

Assembly language instructions C code

■  The instruction set specifies the machine instructions that the processor can execute –  expressed as assembly language instructions

■  Instructions can have 2 or 3 operands –  add a,b a ← a + b , result overwrites a –  add c,a,b c ← a + b , result placed in c

c = a + b! Assembly language register – memory

load add sto

a, R1! b, R1! R1, c!

■  Nr of memory references in an instruction can be –  0 – load/store (RISC) –  1 – Intel x86 –  2 or 3 – CISC architectures

■  Translated to binary machine code (opcodes) by an assembler

Assembly language load – store

load a, R1! load b, R2! add R2, R1! sto R1, c!

–  machine instructions can be of different lengths 7

Micro-operations ■  Machine instructions are decoded into micro-operations (µops) before they are executed ■  Simple instructions generate only one µop –  Example: ADD %RBX, %RAX add the register RBX to the register RAX

■  More complex instructions generate several µops –  Example: ADD %R8, (MEM) add register R8 to the contents in the memory position with address MEM –  may generate 3 µops: •  load the value at address MEM into a register •  add the value in register R8 to the register •  store the register to the memory location at address MEM

■  For complex addressing modes the effective address also has to be computed ■  Micro-operations can be efficiently executed out-of-order in a pipelined fashion

8

4

Instruction pipelining ■  Instruction execution is divided into a number of stages –  –  –  –  – 

instruction fetch instruction decode execute memory writeback

Instructions in

IF

Results out

ID

X

M

W

■  The time to move an instruction one step through the pipeline is called a machine cycle –  can complete one instruction every cycle –  without pipelining we could complete one instruction every 5 cycles

■  CPI – clock Cycles Per Instruction –  the number of cycles needed to execute an instruction –  varies for different instructions

9

Pipelined execution ■  Instruction fetch (IF) –  the next instruction if fetched from memory at the address pointed to by the Program Counter –  increment PC so that it points to the next instruction

■  Instruction decode (ID) –  decode the instruction and identify which type it is –  immediate constant values are extended into 32/64 bits

■  Execution (X) –  if it s an arithmetic operation, execute it in an ALU –  if it s a load/store the address of the operand is computed –  if it s a branch, set the PC to the destination address

■  Memory access (M) –  if it s a load, fetch the content of the address from memory –  if it s a store, write the operand to the specified address in memory –  if it s neither a load nor store, do nothing

■  Writeback (W) –  write the result of the operation to the destination register –  if it s a branch or a store, do nothing

10

5

Pipelined instruction execution ■  All pipeline stages can execute in parallel –  Instruction Level Parallelism –  separate hardware units for each stage Successive instructions load load load load add add sub mul

1

2

3

4

5

6

7

8

9

10

11

Clock cycles

a, R1! b, R2! c, R3! d, R4! R2, R1! #1, R1! R3, R1! R4, R1!

■  After 5 clock cycles, the pipeline is full –  finishes one instruction every clock period –  it takes 5 clock periods to complete one instruction

■  Pipelining increases the CPU instruction throughput –  does not reduce the time to execute an individual instruction 11

Throughput and latency ■  Throughput –  the number of instructions a pipeline can execute per time unit

■  Latency –  the number of clock cycles it takes for the pipeline to complete the execution of an individual instruction

■  Different instructions have different latency and throughput ■  Pipeline examples: –  –  –  – 

Pentium 3 has a 14-stage pipeline Pentium 4 has a 20-stage pipeline Core 2 (Nehalem) has a 14 stage pipeline AMD Opteron (Barcelona) has a 12 stage pipeline

12

6

Superscalar architecture ■  Increases the ability of the processor to use instruction level parallelism ■  Multiple instructions are issued every cycle –  multiple pipelines or functional units operating in parallel

■  Example: –  3 parallel pipelines each with 5 stages –  3-way superscalar processor Successive instructions –  3-issue processor instr. instr. instr. instr. instr. instr. instr. instr. instr. . . .!

1

2

3

4

5

6

7

Clock cycles

1! 2! 3! 4! 5! 6! 7! 8! 9!

13

Pipeline hazards ■  Pipelined execution is efficient if the flow of instructions through the pipeline can proceed without being interrupted ■  Situations that prevent an instruction in the stream from executing during its clock cycle are called pipeline hazards –  hazards may force the pipeline to stall –  may have to stop the instruction fetch for a number of cycles, until all the resources that are needed become available –  also called a pipeline bubble

■  Structural hazards –  caused by resource conflicts –  two instructions need the same hardware unit in the same pipeline stage

■  Data hazards –  arise when an instruction depends on the result of a previous instruction, which has not completed yet

■  Control hazards –  caused by branches in the instruction stream

14

7

Structural hazards ■  Caused by resource conflicts –  the hardware can not simultaneously execute two instructions that need access to the same (single) functional unit –  for instance, if instructions and data are fetched from the same memory port

■  Can be avoided by –  duplicating functional units or access paths to memory –  pipelining functional units –  stalling the instruction execution for at least one cycle •  creates a pipeline bubble

15

Data hazards ■  An instruction depends on the result of a previous instruction, which has not completed yet –  caused by dependences among the data –  different types of data hazards: read-after-write, write-after-write and write-afterread

■  Example: –  the loads write the values into the register in the write-back stage –  R1 will be ready in cycle 4 –  R2 will be ready in cycle 5

■  The add must stall until both R1 and R2 are ready ■  Can be avoided by forwarding and register renaming

load a, R1!

0

1

2

3

IF

ID

X

M

WB

IF

ID

X

M

IF

ID

load b, R2! add R1,R2!

4

Clock cycle 5 6

7

8

WB

X

M

WB

16

8

Control hazards ■  Branch instructions transfer control in the program execution –  may assign a new value to the PC

■  Conditional branches may be taken or not taken –  a taken branch assigns the target address to the PC –  a branch that is not taken (which falls through) continues at the next instruction

■  The instruction is recognized as a branch in the instruction decode phase

....! jnz L1! add #1,R2! sub R4,R3!

–  can decide whether the branch will be taken or not in the execute stage –  the next instruction has to stall

L1:!

■  Can be avoided by branch prediction

jnz L1!

mov #0,R1!

5

6 Clock cycle

ID

X

M

WB

IF

ID

X

M

0

1

2

3

IF

ID

X

M

WB

IF

add #1,R2! sub R4,R3!

4

WB

17

Dependence ■  Pipeline hazards are caused by dependences in the code –  limit the amount of instruction level parallelism that can be used

■  Can avoid hazards (and pipeline stalls) in the execution of a program by using more advanced instruction execution mechanisms –  –  –  –  – 

forwarding register renaming instruction scheduling branch prediction dynamic instruction execution

■  Can also eliminate some dependences by code transformations –  formulate the program in an alternative way, avoiding some dependences

18

9

Data and control dependence ■  Data dependence –  data must be produced and consumed in the correct order in the program execution –  Definition: two statements s and t are data dependent if and only if •  both statements access the same memory location and at least one of them stores into it, and •  there is a feasible run-time execution path from s to t

■  Control dependence –  determines the ordering of instructions with respect to branches –  Example: if p1
 ! ! then s1
 ! ! else s2;! •  s1 and s2 are control dependent on p1 •  we have to first execute p1 before we know which of s1 or s2 should be executed

19

Data dependence ■  Three types of data dependences –  true dependence –  anti-dependence –  output dependence

■  Anti-dependence and output dependence are called name dependences –  two instructions use the same register or memory locations, but there is no actual flow of data between the two instructions –  no real dependence between data, only between the names, i.e. the registers or memory locations (variables) that are used to hold the data

20

10

True dependence ■  An instruction depends on data from a previous instruction –  the first statement stores into a location that is later read by the second statement –  can not execute the statements in the x = a*2;! reverse order y = x+1;! –  can not execute the statements simultaneously in the pipeline without causing a stall

■  Corresponds to a Read After Write (RAW) data hazard between the two instructions

21

Antidependence ■  Anti-dependence –  the first instruction reads from a location into which the second statement stores

y = x+a;! x = b; !

■  Corresponds to a Write After Read (WAR) hazard between the two instructions ■  No value is transmitted between the two statements –  can be executed simultaneously if we choose another name for x in the assignment statement x=b;

■  Can be avoided by using register renaming –  use different registers for the variable x in the two statements

22

11

Output dependence ■  Output dependence –  two instructions write to the same register or memory location

■  Corresponds to a Write After Write (WAW) hazard between the two instructions ■  No value is transmitted between the two statements

x = x+1;! ...! x = b;!

–  can be executed simultaneously if we choose another name for one of the references to x

■  Can be avoided by using register renaming

23

Control dependence ■  Control dependence determine the order of instructions with respect to branches (jumps) in the code –  if p evaluates to TRUE, the instruction s1 is executed –  if p evaluates to FALSE, the instruction s1 is not executed, but is branched over!

s0;! if (p)! then s1;! s2;!

■  Instruction that are control dependent on a branch can not be moved before the branch –  instructions from the then-part of an if-statement can not be executed before the branch

■  Instructions that are not control dependent on a branch can not be moved into the then-part ■  Can avoid hazards by using branch prediction –  speculative execution 24

12

Branch prediction ■  To avoid stalling the pipeline when branch instructions are executed, branch prediction is used –  it is very important to have a good branch prediction mechanism, since branches are very common in most programs –  Example: 20% branch instructions in SPECint92 benchmark

■  Branch prediction is only needed for conditional branches, unconditional branches are always taken –  subroutine calls and goto-statements are always taken –  returns from subroutines need to be predicted

■  Two types of branch prediction mechanisms: –  static uses fixed rules for guessing how branches will go –  dynamic collects information about how the branches have behaved in the past, and use that to predict how the branch will go the next time

25

Mispredicted branches ■  When a misprediction occurs, the processor has executed instructions that should not be executed –  it has to undo the effects of the falsely executed instructions

■  It is not allowed to change the state of the processor until the branch outcome is known

if (f(x)>n)! x = 0; ! else! x = 1; ! . . .!

–  no writeback can be done before the outcome of the branch is ready

■  The instructions that were executed because of a mispredicted branch have to be undone –  flush out the mispredicted instructions from the pipeline –  restart the instruction fetch from the correct branch target

■  The performance penalty of a mispredicted branch is typically as many clock cycles as the length of the pipeline 26

13

Static branch prediction ■  Fixed rules for predicting how a branch will behave –  the prediction is not based on the earlier behavior of the branch –  guess the outcome of the branch, and continue the execution with the predicted instruction

■  Predict as taken / not taken –  the prediction is the same for all branches

■  Direction-based prediction –  backward branches are taken –  forward branches are not taken –  success rate is about 65%

for (i=0; ib) return a;! else return b;! !

int max(int a, int b)! {! return (a>b) ? a : b;! } !

–  jump tables, function pointers

■  Avoid very deep nesting of subroutines –  otherwise the Return Address Stack may overflow –  use iterative functions instead of recursive, if possible 60

30

Branch density ■  If possible, avoid code that contains too many branches –  avoid complex logical expressions that generate dense conditional branches, especially if the branch bodies are small –  in the AMD Opteron, more than three branches in a 16-byte code block leads to resource conflicts in the branch target buffer –  causes unnecessary branch misprediction

■  Branches can be eliminated by using conditional move or conditional set instructions –  it may also be possible to rewrite complex branches with assembly language code that uses conditional moves

61

Order of evaluation in Boolean expressions ■  C and C++ uses short-circuit evaluation for compound Boolean expressions –  in a Boolean expression (a OP b), the second argument is not evaluated if the first argument alone determines the value of the expression –  if a evaluates to TRUE in an expression if (a||b), then b is not evaluated –  if a evaluates to FALSE in an expression if (a&&b), then b is not evaluated

■  If one of the expressions is known to be true more often than the other, arrange the expressions so that the evaluation is shortened –  if a is known to be TRUE 60% of the time and b is TRUE 10% of the time then you should arrange them as (b&&a) and (a || b)

■  If one expression is more predictable, place that first ■  If one expression is much faster to calculate, place that first ■  If the Boolean expressions have side effects or are dependent, they can not be necessarily be rearranged

62

31

Avoid unnecessary branches ■  Use if-else-if constructs to avoid branches if the cases are mutually exclusive –  no need to evaluate all if-statements

■  A switch statement is even better

■  A table lookup can sometimes also be used –  no branch instructions in the generated assembly code

double select(int a) double result;! if (a==0) result = if (a==1) result = if (a==2) result = if (a==3) result = if (a==4) result = return result;! }!

{! 1.13; ! 2.56;! 3.67;! 4.16;! 8.12;!

double select(int a) {! double result;! if (a==0) result = 1.13;! else if (a==1) result = 2.56;! else if (a==2) result = 3.67;! else if (a==3) result = 4.16;! else if (a==4) result = 8.12;! return result;! }!

double select(int a) {! double result[5] = {1.13, 2.56, 3.67, 4.16, 8.12};! return result[a];! }! 63

Order of branches ■  Order branches in if- and switch-statements so that the most likely case comes first –  the other cases do not have to be evaluated if the first one is TRUE

■  Use contiguously numbered case expressions, if possible –  the compiler can translate the switch-statement into a jump table –  if they are non-contiguous, use a series of if-else statements instead switch case case case case }!

(value) {/* Most likely case first */! 0: handle_0(); break; ! 1: handle_1(); break;! 2: handle_2(); break;! 3: handle_3(); break;! if (a==0) {! /* Handle case for a==0 */! }! else if (a==8) {! /* Handle case for a==8 */! }! else {! /* Handle default case */! }! 64

32

Loop unswitching ■  Move loop-invariant conditional constructs out of the loop –  if- or switch-statements which are independent of the loop index can be moved outside of the loop –  the loop is instead repeated in the different branches of the if- or switchstatement –  removes branch instructions from within the loop

■  Removes branch instructions, increases instruction level parallelism, improves possibilities to parallelize the loop –  but increases the amount of code for (i=0; i0)! X[i] = a;! else! X[i] = 0;! }!

if (a>0)! { for (i=0; X[i] }! else! { for (i=0; X[i] }!

i