Microprocessor architecture and instruction execution ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
CISC, RISC and Post-RISC architectures Instruction Set Architecture and microarchitecture Instruction encoding and machine instructions Pipelined and superscalar instruction execution Hazards and dependences Branch prediction Out-of-order instruction execution 32- and 64-bit architectures Multicore architectures and hyperhreading Processor architectures for embedded systems
1
CISC, RISC and post-RISC architectures ■ CISC – Complex Instruction Set Computer – – – – – –
large instruction set instructions can perform very complex operations, powerful assembly language variable instruction formats large number of addressing modes few registers machine instructions implemented with microcode
■ RISC – Reduced Instruction Set Computer – – – – – –
relatively few instructions simple addressing modes, only load/store instructions access memory uniform instruction length many registers no microcode pipelined instruction execution
■ Modern processors have developed further from the basic ideas behind RISC architecture
2
1
Post-RISC architecture ■ Modern processors have developed further from the basic ideas behind RISC architecture – exploit more instruction level parallelism
■ Characteristics: – – – – – –
parallel instruction execution (superscalar) deep pipeline (superpipelined) advanced branch prediction out-of-order instruction execution register renaming extended instruction set
3
Instruction Set Architecture ■ An abstract description of a processor as it is seen by a (assembly language) programmer or compiler writer – abstract model of a processor – defines the instructions, registers and mechanisms to access memory that the processor can use to operate on data
■ Specifies the – – – –
registers machine instructions and their encoding memory addresses addressing modes
■ Examples: Intel IA-32, Intel 64, AMD-64 – defines a family of microprocessors, from the 8086 (1978) to the Intel Core i7 – all binary compatible (within certain limits)
■ Intel 64 and IA-32 Architectures Software Developer's Manuals, available at http://www.intel.com/products/processor/manuals 4
2
Microarchitecture ■ The microarchitecture of a processor defines how the ISA is implemented in hardware – defines how the functionality of the ISA is implemented – execution pipeline, functional units, memory organization, ...
■ Example (Intel processors) – – – –
P6 microarchitecture – from Intel Pentium Pro to Pentium III Netburst microarchitecture – Pentium 4, Xeon Core microarchitecture – Core 2, Xeon Nehalem microarchitecture – Core i5, Core i7
■ The physical details (circuit layout, hardware construction, packaging, etc.) is an implementation of the microarchitecture – two processors can have the same microarchitecture, but different hardware implementations – for instance 90 nm transistor technology, 60 nm, 45 nm high-k metal gate technology or 32 nm technology
5
Instruction encoding ■ Assembly language instructions are encoded into numerical machine instructions by the assembler ■ Instruction formats can be of different types – variable length • supports varying number of operands • typically used in CISC architectures: PDP 11, VAX, Motorola 68000
– fixed format • • • •
always the same number of operands addressing mode is specified as part of the opcode easy to decode, all instructions have the same form typically used in RISC architectures: SPARC, PowerPC, MIPS
– hybrid format • multiple formats, depending on the operation • used in most Intel and AMD processors: IA-32, Intel 64, AMD-64 • machine instructions are split into micro-operations before they are executed
6
3
Assembly language instructions C code
■ The instruction set specifies the machine instructions that the processor can execute – expressed as assembly language instructions
■ Instructions can have 2 or 3 operands – add a,b a ← a + b , result overwrites a – add c,a,b c ← a + b , result placed in c
c = a + b! Assembly language register – memory
load add sto
a, R1! b, R1! R1, c!
■ Nr of memory references in an instruction can be – 0 – load/store (RISC) – 1 – Intel x86 – 2 or 3 – CISC architectures
■ Translated to binary machine code (opcodes) by an assembler
Assembly language load – store
load a, R1! load b, R2! add R2, R1! sto R1, c!
– machine instructions can be of different lengths 7
Micro-operations ■ Machine instructions are decoded into micro-operations (µops) before they are executed ■ Simple instructions generate only one µop – Example: ADD %RBX, %RAX add the register RBX to the register RAX
■ More complex instructions generate several µops – Example: ADD %R8, (MEM) add register R8 to the contents in the memory position with address MEM – may generate 3 µops: • load the value at address MEM into a register • add the value in register R8 to the register • store the register to the memory location at address MEM
■ For complex addressing modes the effective address also has to be computed ■ Micro-operations can be efficiently executed out-of-order in a pipelined fashion
8
4
Instruction pipelining ■ Instruction execution is divided into a number of stages – – – – –
instruction fetch instruction decode execute memory writeback
Instructions in
IF
Results out
ID
X
M
W
■ The time to move an instruction one step through the pipeline is called a machine cycle – can complete one instruction every cycle – without pipelining we could complete one instruction every 5 cycles
■ CPI – clock Cycles Per Instruction – the number of cycles needed to execute an instruction – varies for different instructions
9
Pipelined execution ■ Instruction fetch (IF) – the next instruction if fetched from memory at the address pointed to by the Program Counter – increment PC so that it points to the next instruction
■ Instruction decode (ID) – decode the instruction and identify which type it is – immediate constant values are extended into 32/64 bits
■ Execution (X) – if it s an arithmetic operation, execute it in an ALU – if it s a load/store the address of the operand is computed – if it s a branch, set the PC to the destination address
■ Memory access (M) – if it s a load, fetch the content of the address from memory – if it s a store, write the operand to the specified address in memory – if it s neither a load nor store, do nothing
■ Writeback (W) – write the result of the operation to the destination register – if it s a branch or a store, do nothing
10
5
Pipelined instruction execution ■ All pipeline stages can execute in parallel – Instruction Level Parallelism – separate hardware units for each stage Successive instructions load load load load add add sub mul
1
2
3
4
5
6
7
8
9
10
11
Clock cycles
a, R1! b, R2! c, R3! d, R4! R2, R1! #1, R1! R3, R1! R4, R1!
■ After 5 clock cycles, the pipeline is full – finishes one instruction every clock period – it takes 5 clock periods to complete one instruction
■ Pipelining increases the CPU instruction throughput – does not reduce the time to execute an individual instruction 11
Throughput and latency ■ Throughput – the number of instructions a pipeline can execute per time unit
■ Latency – the number of clock cycles it takes for the pipeline to complete the execution of an individual instruction
■ Different instructions have different latency and throughput ■ Pipeline examples: – – – –
Pentium 3 has a 14-stage pipeline Pentium 4 has a 20-stage pipeline Core 2 (Nehalem) has a 14 stage pipeline AMD Opteron (Barcelona) has a 12 stage pipeline
12
6
Superscalar architecture ■ Increases the ability of the processor to use instruction level parallelism ■ Multiple instructions are issued every cycle – multiple pipelines or functional units operating in parallel
■ Example: – 3 parallel pipelines each with 5 stages – 3-way superscalar processor Successive instructions – 3-issue processor instr. instr. instr. instr. instr. instr. instr. instr. instr. . . .!
1
2
3
4
5
6
7
Clock cycles
1! 2! 3! 4! 5! 6! 7! 8! 9!
13
Pipeline hazards ■ Pipelined execution is efficient if the flow of instructions through the pipeline can proceed without being interrupted ■ Situations that prevent an instruction in the stream from executing during its clock cycle are called pipeline hazards – hazards may force the pipeline to stall – may have to stop the instruction fetch for a number of cycles, until all the resources that are needed become available – also called a pipeline bubble
■ Structural hazards – caused by resource conflicts – two instructions need the same hardware unit in the same pipeline stage
■ Data hazards – arise when an instruction depends on the result of a previous instruction, which has not completed yet
■ Control hazards – caused by branches in the instruction stream
14
7
Structural hazards ■ Caused by resource conflicts – the hardware can not simultaneously execute two instructions that need access to the same (single) functional unit – for instance, if instructions and data are fetched from the same memory port
■ Can be avoided by – duplicating functional units or access paths to memory – pipelining functional units – stalling the instruction execution for at least one cycle • creates a pipeline bubble
15
Data hazards ■ An instruction depends on the result of a previous instruction, which has not completed yet – caused by dependences among the data – different types of data hazards: read-after-write, write-after-write and write-afterread
■ Example: – the loads write the values into the register in the write-back stage – R1 will be ready in cycle 4 – R2 will be ready in cycle 5
■ The add must stall until both R1 and R2 are ready ■ Can be avoided by forwarding and register renaming
load a, R1!
0
1
2
3
IF
ID
X
M
WB
IF
ID
X
M
IF
ID
load b, R2! add R1,R2!
4
Clock cycle 5 6
7
8
WB
X
M
WB
16
8
Control hazards ■ Branch instructions transfer control in the program execution – may assign a new value to the PC
■ Conditional branches may be taken or not taken – a taken branch assigns the target address to the PC – a branch that is not taken (which falls through) continues at the next instruction
■ The instruction is recognized as a branch in the instruction decode phase
....! jnz L1! add #1,R2! sub R4,R3!
– can decide whether the branch will be taken or not in the execute stage – the next instruction has to stall
L1:!
■ Can be avoided by branch prediction
jnz L1!
mov #0,R1!
5
6 Clock cycle
ID
X
M
WB
IF
ID
X
M
0
1
2
3
IF
ID
X
M
WB
IF
add #1,R2! sub R4,R3!
4
WB
17
Dependence ■ Pipeline hazards are caused by dependences in the code – limit the amount of instruction level parallelism that can be used
■ Can avoid hazards (and pipeline stalls) in the execution of a program by using more advanced instruction execution mechanisms – – – – –
forwarding register renaming instruction scheduling branch prediction dynamic instruction execution
■ Can also eliminate some dependences by code transformations – formulate the program in an alternative way, avoiding some dependences
18
9
Data and control dependence ■ Data dependence – data must be produced and consumed in the correct order in the program execution – Definition: two statements s and t are data dependent if and only if • both statements access the same memory location and at least one of them stores into it, and • there is a feasible run-time execution path from s to t
■ Control dependence – determines the ordering of instructions with respect to branches – Example: if p1
! ! then s1
! ! else s2;! • s1 and s2 are control dependent on p1 • we have to first execute p1 before we know which of s1 or s2 should be executed
19
Data dependence ■ Three types of data dependences – true dependence – anti-dependence – output dependence
■ Anti-dependence and output dependence are called name dependences – two instructions use the same register or memory locations, but there is no actual flow of data between the two instructions – no real dependence between data, only between the names, i.e. the registers or memory locations (variables) that are used to hold the data
20
10
True dependence ■ An instruction depends on data from a previous instruction – the first statement stores into a location that is later read by the second statement – can not execute the statements in the x = a*2;! reverse order y = x+1;! – can not execute the statements simultaneously in the pipeline without causing a stall
■ Corresponds to a Read After Write (RAW) data hazard between the two instructions
21
Antidependence ■ Anti-dependence – the first instruction reads from a location into which the second statement stores
y = x+a;! x = b; !
■ Corresponds to a Write After Read (WAR) hazard between the two instructions ■ No value is transmitted between the two statements – can be executed simultaneously if we choose another name for x in the assignment statement x=b;
■ Can be avoided by using register renaming – use different registers for the variable x in the two statements
22
11
Output dependence ■ Output dependence – two instructions write to the same register or memory location
■ Corresponds to a Write After Write (WAW) hazard between the two instructions ■ No value is transmitted between the two statements
x = x+1;! ...! x = b;!
– can be executed simultaneously if we choose another name for one of the references to x
■ Can be avoided by using register renaming
23
Control dependence ■ Control dependence determine the order of instructions with respect to branches (jumps) in the code – if p evaluates to TRUE, the instruction s1 is executed – if p evaluates to FALSE, the instruction s1 is not executed, but is branched over!
s0;! if (p)! then s1;! s2;!
■ Instruction that are control dependent on a branch can not be moved before the branch – instructions from the then-part of an if-statement can not be executed before the branch
■ Instructions that are not control dependent on a branch can not be moved into the then-part ■ Can avoid hazards by using branch prediction – speculative execution 24
12
Branch prediction ■ To avoid stalling the pipeline when branch instructions are executed, branch prediction is used – it is very important to have a good branch prediction mechanism, since branches are very common in most programs – Example: 20% branch instructions in SPECint92 benchmark
■ Branch prediction is only needed for conditional branches, unconditional branches are always taken – subroutine calls and goto-statements are always taken – returns from subroutines need to be predicted
■ Two types of branch prediction mechanisms: – static uses fixed rules for guessing how branches will go – dynamic collects information about how the branches have behaved in the past, and use that to predict how the branch will go the next time
25
Mispredicted branches ■ When a misprediction occurs, the processor has executed instructions that should not be executed – it has to undo the effects of the falsely executed instructions
■ It is not allowed to change the state of the processor until the branch outcome is known
if (f(x)>n)! x = 0; ! else! x = 1; ! . . .!
– no writeback can be done before the outcome of the branch is ready
■ The instructions that were executed because of a mispredicted branch have to be undone – flush out the mispredicted instructions from the pipeline – restart the instruction fetch from the correct branch target
■ The performance penalty of a mispredicted branch is typically as many clock cycles as the length of the pipeline 26
13
Static branch prediction ■ Fixed rules for predicting how a branch will behave – the prediction is not based on the earlier behavior of the branch – guess the outcome of the branch, and continue the execution with the predicted instruction
■ Predict as taken / not taken – the prediction is the same for all branches
■ Direction-based prediction – backward branches are taken – forward branches are not taken – success rate is about 65%
for (i=0; ib) return a;! else return b;! !
int max(int a, int b)! {! return (a>b) ? a : b;! } !
– jump tables, function pointers
■ Avoid very deep nesting of subroutines – otherwise the Return Address Stack may overflow – use iterative functions instead of recursive, if possible 60
30
Branch density ■ If possible, avoid code that contains too many branches – avoid complex logical expressions that generate dense conditional branches, especially if the branch bodies are small – in the AMD Opteron, more than three branches in a 16-byte code block leads to resource conflicts in the branch target buffer – causes unnecessary branch misprediction
■ Branches can be eliminated by using conditional move or conditional set instructions – it may also be possible to rewrite complex branches with assembly language code that uses conditional moves
61
Order of evaluation in Boolean expressions ■ C and C++ uses short-circuit evaluation for compound Boolean expressions – in a Boolean expression (a OP b), the second argument is not evaluated if the first argument alone determines the value of the expression – if a evaluates to TRUE in an expression if (a||b), then b is not evaluated – if a evaluates to FALSE in an expression if (a&&b), then b is not evaluated
■ If one of the expressions is known to be true more often than the other, arrange the expressions so that the evaluation is shortened – if a is known to be TRUE 60% of the time and b is TRUE 10% of the time then you should arrange them as (b&&a) and (a || b)
■ If one expression is more predictable, place that first ■ If one expression is much faster to calculate, place that first ■ If the Boolean expressions have side effects or are dependent, they can not be necessarily be rearranged
62
31
Avoid unnecessary branches ■ Use if-else-if constructs to avoid branches if the cases are mutually exclusive – no need to evaluate all if-statements
■ A switch statement is even better
■ A table lookup can sometimes also be used – no branch instructions in the generated assembly code
double select(int a) double result;! if (a==0) result = if (a==1) result = if (a==2) result = if (a==3) result = if (a==4) result = return result;! }!
{! 1.13; ! 2.56;! 3.67;! 4.16;! 8.12;!
double select(int a) {! double result;! if (a==0) result = 1.13;! else if (a==1) result = 2.56;! else if (a==2) result = 3.67;! else if (a==3) result = 4.16;! else if (a==4) result = 8.12;! return result;! }!
double select(int a) {! double result[5] = {1.13, 2.56, 3.67, 4.16, 8.12};! return result[a];! }! 63
Order of branches ■ Order branches in if- and switch-statements so that the most likely case comes first – the other cases do not have to be evaluated if the first one is TRUE
■ Use contiguously numbered case expressions, if possible – the compiler can translate the switch-statement into a jump table – if they are non-contiguous, use a series of if-else statements instead switch case case case case }!
(value) {/* Most likely case first */! 0: handle_0(); break; ! 1: handle_1(); break;! 2: handle_2(); break;! 3: handle_3(); break;! if (a==0) {! /* Handle case for a==0 */! }! else if (a==8) {! /* Handle case for a==8 */! }! else {! /* Handle default case */! }! 64
32
Loop unswitching ■ Move loop-invariant conditional constructs out of the loop – if- or switch-statements which are independent of the loop index can be moved outside of the loop – the loop is instead repeated in the different branches of the if- or switchstatement – removes branch instructions from within the loop
■ Removes branch instructions, increases instruction level parallelism, improves possibilities to parallelize the loop – but increases the amount of code for (i=0; i0)! X[i] = a;! else! X[i] = 0;! }!
if (a>0)! { for (i=0; X[i] }! else! { for (i=0; X[i] }!
i