CS 3853 Architecture Notes on Appendix C Section 2

Read Appendix C

C2: Pipeline Hazards

A hazard prevents the next instruction from executing during its designated clock cycle.

Hazard Classifications

structural hazard: insufficient hardware due to overlapped execution.
data hazard: instruction needs data from a previous instruction before it is available.
control hazard: branch changes the PC after a later instruction has been fetched.

A hazard may require that the pipeline stalls until the hazard can be cleared.
For now, when a stall occurs:

all instructions issued later will also stall
all instructions issued earlier will continue so that the hazard can be cleared

Performance with stalls

We will compare an unpipelined machine in which instructions take several cycles to a pipelined machine with the same clock rate.

speedup =
Average instruction time unpipelined
Average instruction time pipelined
=
CPI unpipelined
CPI pipelined
with no hazards, CPI pipelined = 1.
with hazards, CPI pipelined = 1 + stall cycles per instruction
speedup =
CPI unpipelined
1 + stall cycles per instruction
in the case in which all instructions on the unpipelined machine take the same time and the pipeline is completely balanced with no overhead:
CPI unpipelined = pipeline depth and
speedup =
pipeline depth
1 + stall cycles per instruction

Structural Hazards

At some stage of the pipeline, two instructions require the same resource.
A shared single-memory port for data and instructions

instruction memory is always used in the first stage of the pipeline for the instruction fetch
a load (or store) instruction will access the data memory in the 4th stage (MEM)
with a shared single-memory for data and instructions we cannot access the instruction memory and data memory in the same clock cycle.
Figure C.4 shows a load instruction followed by 4 non-memory instructions.

Here is a timing diagram showing the stall (like figure C.5):
This assumes none of the other instructions are loads or stores so they do not need to access memory in the MEM stage.

clock number

Instruction

Load instruction

MEM

Instruction i+1

MEM

Instruction i+2

MEM

Instruction i+3

stall

MEM

Instruction i+4

MEM

Instruction i+5

MEM

Instruction i+6

MEM

Examples:

ClassQue: Memory structural hazard

Compare the corresponding balanced unpipelined machine with a 5-stage pipelined machine with one shared memory port to a pipelined machine with a single memory port in which loads and stores together make up 30% of the instructions.
Since most computers use the same memory of data and instructions, why is the above not a problem for modern machines?

Data Hazards

These occur when the pipeline would change the order of read/write accesses so that they differ from the order of unpipelined execution.
Consider:

DADD    R1, R2, R3
DSUB    R4, R1, R5
AND     R6, R1, R7
OR      R8, R1, R9
XOR     R10 R1, R11

Recall that in each of these instructions, the first register is the destination of the operation.
We assume that in ID, register reads occur at the end of the cycle, and in WB, the register writes occur at the start of the cycle.
We will see how this eliminates one of the hazards.
Each instruction after the first uses R1.
Figure C.6 shows the execution of these instructions in the standard pipeline.
The DSUB instruction needs the new value of R1 at the end of CC 3, but it is not available until the beginning of CC 5 has completed.
Similarly, the AND needs it at the end of CC 4.
The OR needs it at the end of CC 5 so this should be OK.
The XOR doesn't need it until CC 6 so it is fine.

Today's News: September 17

No news yet.

Forwarding
The idea of forwarding is that even though a result is not stored in the register file until WB, it is often available several cycles earlier. For an ALU instruction it is available in the EX/MEM pipeline register.
Figure C.7 shows how the two stalls from the previous example can be eliminated by forwarding (part of the) contents of the pipeline registers to the next stage.
Problem

ClassQue: forwarding hardware

To implement forwarding for the first two instructions in the example above, one of the ALU inputs must be able to be gotten from two different places depending on the previous instruction.

From which pipeline register(s) does the ALU get its inputs?
What type of circuit is required to implement this?

Examples:
Consider:

DADD    R1, R2, R3
LD      R4, 0(R1)
SD      R4, 12(R1)

The LD and SD use the ALU to calculate the effective address in EX.
Figure C.8 shows how forwarding can be used to get R1 before it is stored back in the register file.
Also, the value of R4 from the LD is given to the SD before it goes into the register file.

Sometimes stalls are necessary
Consider:

LD      R1, 0(R2)
DSUB    R4, R1, R5
AND     R6, R1, R7
OR      R8, R1, R9

Figure C.9 shows the required forwarding paths.
The DSUB needs the result of the LD before it is available anywhere, so a stall is required.
The EX cycle of the DSUB requires the value generated in the MEM cycle of the LD which occurs at the same time.
It is fixed by introducing a stall in before the EX cycle of the DSUB.
All subsequent instructions are also stalled.

	clock number
Instruction	1	2	3	4	5	6	7	8	9
`LD R1,0(R2)`	IF	ID	EX	MEM	WB
`DSUB R4,R1,R5`		IF	ID	stall	EX	MEM	WB
`AND R6,R1,R7`			IF	stall	ID	EX	MEM	WB
`OR R8,R1,R9`				stall	IF	ID	EX	MEM	WB

Question:

ClassQue: stalls after memory access

Suppose the sequence of instructions is:
```
LD      R1, 0(R2)
DSUB    R4, R1, R5
AND     R6, R7, R7
OR      R8, R7, R7
```
1. Would we still have to delay the AND and OR instructions, even though they use different registers? Why?
2. How could you prevent the stalls in this code sequence?
Answer:
?

Branch Hazards

Control hazards can cause a significant performance loss.

If a branch changes the PC to its target address, we say the branch is taken.
Otherwise, it is not taken or untaken.
If a branch is taken, the PC is modified at the end of ID.
At this point the next instruction has already been fetched and needs to be discarded.
One way to do this is to always redo the fetch for a branch instruction as shown below:

Branch instruction IF ID EX MEM WB

Branch successor IF IF ID EX MEM WB

Branch successor + 1 IF ID EX MEM WB

Branch successor + 2 IF ID EX MEM WB

Four static methods of dealing with branch stalls
Method 1: freeze or flush the pipeline
This is the method that was shown above.
The penalty is always one cycle and cannot be fixed by software.

Method 2: treat every branch as not taken

In general we would need to back out of any action that occurred when we find out the the branch was taken.
In our simple 5-stage pipeline, turn the next instruction into a no-op.
This works because we can tell if the branch is taken during ID.

This is illustrated below:

Untaken branch instruction	IF	ID	EX	MEM	WB
Instruction i + 1		IF	ID	EX	MEM	WB
Instruction i + 2			IF	ID	EX	MEM	WB
Instruction i + 3				IF	ID	EX	MEM	WB

Taken branch instruction	IF	ID	EX	MEM	WB
Instruction i + 1		IF	idle	idle	idle	idle
Branch target			IF	ID	EX	MEM	WB
Branch target + 1				IF	ID	EX	MEM	WB
Branch target + 2					IF	ID	EX	MEM	WB

While Method 1 always causes a stall for each branch, this only causes a stall if the branch is taken.

Today's News: September 19

Pick up your Assignment 1 if you handed it in on Tuesday.

Method 3: treat every branch as taken

Not useful in the 5-stage pipeline since we do not know the branch target until after ID, which is too late.
For longer pipelines, this method will make the penalty the smallest if the branch is taken.

Method 4: The delayed branch

Must be a feature of the ISA an therefore the programmer (or compliler writer) must take this into account.

The instruction after a branch is always executed, whether the branch is taken or not.

This allows us to know the branch target and whether the branch is taken before we fetch the next instruction.

There are no branch stalls with this method as long as a useful instruction can be put in the delay slot.

It is the job of the complier to schedule a useful instruction into the delay slot.

If none can be found, the delay slot can be filled with a no-op.

Figure C.14 shows several methods of scheduling the delay slot.
The best method of scheduling the delay slot might depend on whether the branch is taken or not.

Reducing the branch cost through prediction

There are 2 classes of branch prediction:

static prediction: low cost - can be used by compliers
dynamic prediction: based on program behavior

Static Branch Prediction

Use profiling to predict which branches are usually taken and which ones are usually not taken.
Figure C.17 shows the success of this strategy for some SPEC benchmarks.
In SPEC, branches make up between 3% and 24% of all instructions executed.

Dynamic Branch Prediction

The simplest technique uses a branch prediction buffer or branch history table.

branch prediction buffer: small memory indexed by the low bits of the address of the branch instruction.
simple: each entry has a bit indicating whether the last branch at that address was taken or untaken.
better: each entry has 2 bits so that a prediction must miss twice before it is changed.
Figure C.18 shows how this scheme works.
Figure C.19 shows the accuracy of a 2-bit predictor on some SPEC benchmarks.
Figure C.20 shows that adding more entries to the table has little effect.

Next Notes

Back to CS 3853 Notes Table of Contents

Branch instruction	IF	ID	EX	MEM	WB
Branch successor		IF	IF	ID	EX	MEM	WB
Branch successor + 1				IF	ID	EX	MEM	WB
Branch successor + 2					IF	ID	EX	MEM	WB