Read Appendix C
C2: Pipeline Hazards
A hazard prevents the next instruction from executing during its designated clock cycle.
Hazard Classifications
- structural hazard: insufficient hardware due to overlapped execution.
- data hazard: instruction needs data from a previous instruction before it is available.
- control hazard: branch changes the PC after a later instruction has been fetched.
A hazard may require that the pipeline stalls until the hazard can be cleared.
For now, when a stall occurs:
- all instructions issued later will also stall
- all instructions issued earlier will continue so that the hazard can be cleared
Performance with stalls
We will compare an unpipelined machine in which instructions take several cycles to a pipelined machine
with the same clock rate.
- speedup =
Average instruction time unpipelined
Average instruction time pipelined
= CPI unpipelined
CPI pipelined
- with no hazards, CPI pipelined = 1.
- with hazards, CPI pipelined = 1 + stall cycles per instruction
- speedup =
CPI unpipelined
1 + stall cycles per instruction
- in the case in which all instructions on the unpipelined machine take the same time and the pipeline is completely
balanced with no overhead:
CPI unpipelined = pipeline depth and
speedup = pipeline depth
1 + stall cycles per instruction
Structural Hazards
At some stage of the pipeline, two instructions require the same resource.
A shared single-memory port for data and instructions
- instruction memory is always used in the first stage of the pipeline for the instruction fetch
- a load (or store) instruction will access the data memory in the 4th stage (MEM)
- with a shared single-memory for data and instructions we cannot access the instruction memory and data memory in the same clock cycle.
- Figure C.4
shows a load instruction followed by 4 non-memory instructions.
- Here is a timing diagram showing the stall (like figure C.5):
This assumes none of the other instructions are loads or stores so they do not need to access memory in the MEM stage.
| clock number |
Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
Load instruction | IF | ID | EX | MEM | WB | |
Instruction i+1 | | IF | ID | EX | MEM | WB | |
Instruction i+2 | | | IF | ID | EX | MEM | WB | |
Instruction i+3 | | | | stall | IF | ID | EX | MEM | WB | |
Instruction i+4 | | | | | | IF | ID | EX | MEM | WB | |
Instruction i+5 | | | | | | | IF | ID | EX | MEM | WB | |
Instruction i+6 | | | | | | | | IF | ID | EX | MEM | WB | |
Examples:
ClassQue: Memory structural hazard
- Compare the corresponding balanced unpipelined machine with a 5-stage pipelined machine with one shared memory port
to a pipelined machine with a single memory port in which loads and stores together make up 30% of the instructions.
- Since most computers use the same memory of data and instructions, why is the above not a problem for modern machines?
Data Hazards
These occur when the pipeline would change the order of read/write accesses so that they differ from the order of unpipelined execution.
Consider:
DADD R1, R2, R3
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10 R1, R11
- Recall that in each of these instructions, the first register is the destination of the operation.
- We assume that in ID, register reads occur at the end of the cycle, and in WB, the register writes occur at the start of the cycle.
We will see how this eliminates one of the hazards.
- Each instruction after the first uses R1.
- Figure C.6
shows the execution of these instructions in the standard pipeline.
- The DSUB instruction needs the new value of R1 at the end of CC 3, but it is not available until the beginning of CC 5 has completed.
- Similarly, the AND needs it at the end of CC 4.
- The OR needs it at the end of CC 5 so this should be OK.
- The XOR doesn't need it until CC 6 so it is fine.
Today's News: September 17
No news yet.
Forwarding
The idea of forwarding is that even though a result is not stored in the register file until
WB,
it is often available several cycles earlier. For an ALU instruction it is available in the
EX/MEM
pipeline register.
Figure C.7
shows how the two stalls from the previous example can be eliminated by forwarding (part of the) contents of
the pipeline registers to the next stage.
Problem
ClassQue: forwarding hardware
To implement forwarding for the first two instructions in the example above, one of the ALU inputs
must be able to be gotten from two different places depending on the previous instruction.
- From which pipeline register(s) does the ALU get its inputs?
- What type of circuit is required to implement this?
Examples:
Consider:
DADD R1, R2, R3
LD R4, 0(R1)
SD R4, 12(R1)
The
LD and
SD use the ALU to calculate the effective address in
EX.
Figure C.8
shows how forwarding can be used to get
R1 before it is stored back in the register file.
Also, the value of
R4 from the
LD is given to the
SD before it goes into the register file.
Sometimes stalls are necessary
Consider:
LD R1, 0(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
Figure C.9
shows the required forwarding paths.
The
DSUB needs the result of the
LD before it is available anywhere, so a stall is required.
The
EX cycle of the
DSUB requires the value generated in the
MEM cycle of the
LD
which occurs at the same time.
It is fixed by introducing a stall in before the
EX cycle of the
DSUB.
All subsequent instructions are also stalled.
| clock number |
Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
LD R1,0(R2) | IF | ID | EX | MEM | WB | |
DSUB R4,R1,R5 | | IF | ID | stall | EX | MEM | WB | |
AND R6,R1,R7 | | | IF | stall | ID | EX | MEM | WB | |
OR R8,R1,R9 | | | | stall | IF | ID | EX | MEM | WB | |
Question:
ClassQue: stalls after memory access
- Suppose the sequence of instructions is:
LD R1, 0(R2)
DSUB R4, R1, R5
AND R6, R7, R7
OR R8, R7, R7
- Would we still have to delay the AND and OR instructions, even though they use different registers? Why?
- How could you prevent the stalls in this code sequence?
Answer:
?
Branch Hazards
Control hazards can cause a significant performance loss.
- If a branch changes the PC to its target address, we say the branch is taken.
- Otherwise, it is not taken or untaken.
- If a branch is taken, the PC is modified at the end of ID.
- At this point the next instruction has already been fetched and needs to be discarded.
- One way to do this is to always redo the fetch for a branch instruction as shown below:
Branch instruction | IF | ID | EX | MEM | WB | |
Branch successor | | IF | IF | ID | EX | MEM | WB | |
Branch successor + 1 | | | | IF | ID | EX | MEM | WB | |
Branch successor + 2 | | | | | IF | ID | EX | MEM | WB | |
Four static methods of dealing with branch stalls
Method 1: freeze or flush the pipeline
This is the method that was shown above.
The penalty is always one cycle and cannot be fixed by software.
Method 2: treat every branch as not taken
- In general we would need to back out of any action that occurred when we find out the the branch was taken.
- In our simple 5-stage pipeline, turn the next instruction into a no-op.
- This works because we can tell if the branch is taken during ID.
- This is illustrated below:
Untaken branch instruction | IF | ID | EX | MEM | WB | |
Instruction i + 1 | | IF | ID | EX | MEM | WB | |
Instruction i + 2 | | | IF | ID | EX | MEM | WB | |
Instruction i + 3 | | | | IF | ID | EX | MEM | WB | |
|
Taken branch instruction | IF | ID | EX | MEM | WB | |
Instruction i + 1 | | IF | idle | idle | idle | idle | |
Branch target | | | IF | ID | EX | MEM | WB | |
Branch target + 1 | | | | IF | ID | EX | MEM | WB | |
Branch target + 2 | | | | | IF | ID | EX | MEM | WB | |
- While Method 1 always causes a stall for each branch, this only causes a stall if the branch is taken.
Today's News: September 19
Pick up your Assignment 1 if you handed it in on Tuesday.
Method 3: treat every branch as taken
- Not useful in the 5-stage pipeline since we do not know the branch target until after ID, which is too late.
- For longer pipelines, this method will make the penalty the smallest if the branch is taken.
Method 4: The delayed branch
- Must be a feature of the ISA an therefore the programmer (or compliler writer) must take this into account.
- The instruction after a branch is always executed, whether the branch is taken or not.
- This allows us to know the branch target and whether the branch is taken before we fetch the next instruction.
- There are no branch stalls with this method as long as a useful instruction can be put in the delay slot.
- It is the job of the complier to schedule a useful instruction into the delay slot.
- If none can be found, the delay slot can be filled with a no-op.
- Figure C.14
shows several methods of scheduling the delay slot.
- The best method of scheduling the delay slot might depend on whether the branch is taken or not.
Reducing the branch cost through prediction
There are 2 classes of branch prediction:
- static prediction: low cost - can be used by compliers
- dynamic prediction: based on program behavior
Static Branch Prediction
- Use profiling to predict which branches are usually taken and which ones are usually not taken.
- Figure C.17
shows the success of this strategy for some SPEC benchmarks.
- In SPEC, branches make up between 3% and 24% of all instructions executed.
Dynamic Branch Prediction
The simplest technique uses a branch prediction buffer or branch history table.
- branch prediction buffer: small memory indexed by the low bits of the address of the branch instruction.
- simple: each entry has a bit indicating whether the last branch at that address was taken or untaken.
- better: each entry has 2 bits so that a prediction must miss twice before it is changed.
- Figure C.18
shows how this scheme works.
- Figure C.19
shows the accuracy of a 2-bit predictor on some SPEC benchmarks.
- Figure C.20
shows that adding more entries to the table has little effect.