CS 3853 Architecture Notes on Appendix C Section 3

IF
- IR ← Mem[PC]
- NPC ← PC + 4
ID
- A ← Regs[rs]
- B ← Regs[rt]
- Imm ← sign-extended field of IR
EX
- Load or Store:
  ALUOutput ← A + Imm
- RR ALU:
  ALUOutput ← A funct B
- R-Imm ALU
  ALUOutput ← A op Imm
- Branch:
  ALUOutput ← NPC + (Imm << 2)
  Cond ← (A == 0)

ClassQue: Figure C.21 Muxes A and B

MEM
- if Branch and cond PC ← ALUOutput
  otherwise PC ← PC + 4
- Load:
  LMD ← Mem[ALUOutput]
- Store:
  Mem[ALUOutput] ← B
WB
- Load:
  Regs[rt] ← LMD
- RR ALU:
  Regs[rd] ← ALUOutput
- R-Imm ALU:
  Regs[rt] ← ALUOutput

Question:

ClassQue: Figure C.21 Instruction Encoding

The RR instruction is described as:

RR ALU: Regs[rd] ← Regs[rs] funct Regs[rt]

What would have to change if instead it were:

RR ALU: Regs[rs] ← Regs[rt] funct Regs[rd]

Answer:

Pipelined Implementation

Figure C.22 shows a corresponding pipeline implementation.
The registers NPC, IR, A, B, Imm, Cond, ALUOutput and LMD are now contained in the pipeline registers.
Examples:

NPC is contained in which pipeline register?
Answer:
NPC is created in IF so it it stored in the IF/ID register.
It is needed in EX, so it must be also be in ID/EX.
IR is stored in which pipieline registers?
Answer:
Parts of the IR register are needed in each cycle, so for simplicity, the entire IR is propagated to each pipeline register. This is somewhat inefficient.

Today's News: September 26

Exam 1 will be one week from today.

Examples: Figure C.23 shows the details of the pipelined execution for each type of instruction.

Below is a comparison for the RR ALU instruction. See Figures C.21 and C.22
Operations that are performed, but not needed for this instruction are shown this way: operation.

Stage	Unpipelined	Pipielined
IF	IR ← Mem[PC] NPC ← PC + 4	IF/ID.IR ← Mem[PC] PC ← PC + 4 IF/ID.NPC ← PC + 4
ID	A ← Regs[IR.rs] B ← Regs[IR.rt] Imm ← sign-extended(IR.Immediate)	ID/EX.A ← Regs[ID/IF.IR.rs] ID/EX.B ← Regs[ID/IF.IR.rt] ID/EX.NPC ← IF/ID.NPC ID/EX.IR ← ID/ID.IR ID/EX.Imm ← sign-extended(IF/ID.IR.Immediate)
EX	ALUOutput ← A funct B	EX/MEM.IR ← ID/EX.IR EX/MEM.ALUOutput ← ID/EX.A funct ID/EX.B
MEM	PC ← PC + 4	MEM/WB.IR ← EX/MEM.IR MEM/WB.ALUOutput ← EX/MEM.ALUOutput
WB	Regs[IR.rd] ← ALUOutput	Regs[MEM/WB.IR.rd] ← MEM/WB.ALUOutput

Note: My notation is slightly different from that of the book.

For the unpipelined case I use IR.rs instead of just rs, etc.
For the pipelined case I use XX/XX.IR.rs instead of XX/XX.IR[rs]

ClassQue: Exam 1 Sample Problems 1

ClassQue: Exam 1 Sample Problems 2

problems solutions

How Branches Work

Branches are hard.

We already know that branches can cause stalls.
The problem is that we might not know the branch address or if the branch is taken until one or more additional instructions have been fetched, and possibly executed.
We are saved by the fact that the external state (what programs see) is not changed until MEM or WB.

In the unpipelined architecture shown in Figure C.21:

the NPC stores the potential new PC during IF
the branch address and whether the branch is taken is computed in EX
the PC is updated in MEM
for a branch, the instruction is complete after the MEM cycle.

The the pipelined architecture shown in Figure C.22 has a 3-cycle stall when a branch is taken:
Suppose the instruction stream looks like:

instruction (not branch)
instruction (not branch)
instruction (not branch)
instruction A: taken branch
instruction B
instruction C
instruction D
...
instruction X: branch target

The PC is set at the end of IF to either PC+4 (normally) or if the Zero? field of EX/MEM is not 0 it is set to the ALU result
The Zero? field of EX/MEM stays 0 until the branch instruction is executed.
If the branch instruction is fetched in cycle i:

cycle i:
- IF of A: taken branch is fetched
- IF of A: branch instruction stored in IF/ID
- IF of A: PC + 4 stored in PC (address of instruction i+1)
cycle i+1:
- IF of B: instruction B at i+1 is fetched
- IF of B: PC + 4 is stored in PC (address of instruction i+2)
- ID of A: branch base register stored in ID/EX
- ID of A: branch destination offset is stored in ID/EX
- ID of A: branch instruction is stored in ID/EX (from IF.ID)
cycle i+2:
- IF of C: instruction C at i+2 is fetched
- IF of C: PC + 4 is stored in PC (address of instruction i+3)
- ID of B: instruction B at i+1 is decoded
- EX of A: branch instruction Zero? stored in EX/MEM (this is 1 since the branch is taken)
- EX of A: branch destination stored in EX/MEM
cycle i+3:
- IF of D: instruction D at i+3 is fetched
- IF of D: branch destination is stored in PC (since Zero? field of EX/MEM is now set)
- ID of C: instruction C at i+2 is decoded
- EX of B: instruction B at i+1 is executed
  Note that even if this is a branch, we do not want to set Zero?
- MEM of A: nothing (for branch instruction)
cicle i+4:
- IF: branch destination is fetched

Today's News: October 1

Exam 1 will be on Thursday.
You can find the figure sheet for the exam here.
You will not need to use the bottom figure which
shows the hardware needed to eliminate branch stalls.
Recitation this week has many sample problems,
most of which will not be covered in the recitation.
Solutions for these problems will be available this afternoon.

Examples: The timing diagram looks like this:

instruction	cycle i	cycle i+1	cycle i+2	cycle i+3	cycle i+4	cycle i+5	cycle i+6	cycle i+7	cycle i+8
instruction A (taken branch)	IF	ID	EX	MEM	WB
instruction B		IF	ID	EX	MEM	WB
instruction C			IF	ID	EX	MEM	WB
instruction D				IF	ID	EX	MEM	WB
instruction X (branch destination)					IF	ID	EX	MEM	WB

The PC is changed at the end each cycle and is either PC+4 or the ALU output depending on what is in the MEM/EX register which was set on the previous cycle in MEM.
The branch instruction sets this in cycle i+3 so it affects the fetch in cycle i+4
We inhibit the MEM and WB actions in the next 3 instructions so the effect is that these are not executed.
This produces 3 stalls for each taken branch.

ClassQue: Pipeline Branch

Reducing the branch penalty

Figure C.28 shows how to reduce the branch taken penalty from 3 to 1. Figures C.22 and C.28 compared

Must know if branch is taken in ID, rather than EX
- Zero? is done in ID rather than EX
- This is easy if with only have branch on zero or nonzero
- Requires more hardware if branch on compare 2 registers
Must compute branch address in ID
- requires an adder in ID after the register file read
- might increase the clock cycle time, but result is not fed into ID/EX
Must feed results of Add and Zero? directly into PC mux rather than into ID/EX to save one cycle

The timing diagram now looks like this:

instruction	cycle i	cycle i+1	cycle i+2	cycle i+3	cycle i+4	cycle i+5	cycle i+6
instruction A (taken branch)	IF	ID	EX	MEM	WB
instruction B		IF	ID	EX	MEM	WB
instruction X (branch destination)			IF	ID	EX	MEM	WB

Questions:

Why do we not strike out the ID and EX of instruction B?
Answer:
We do not have to since they do not change the external state.
Why don't with strike out the MEM and WB for instruction A?
Answer:
A branch instruction does not do anything in these stages.

Dealing with data hazards

Recall that there are 3 types of hazards: structural, data, and control.
Structural hazards will not occur because we included enough hardware.
The above discussion showed how to handle control hazards.
When a data hazard occurs, we need to either stall the pipeline, or elimintate the hazard by using forwarding.

ClassQue: forwarding hardware

Examples:

The following requires a stall of the DADD instruction:
```
     LD    R1, 45(R2)
     DADD  R5, R1, R7
```
- This can be detected in the ID stage of the DADD instruction by comparing rt of the LD instruction to rs and rt of the DADD instruction.
- During the ID stage of DADD, rs is in IF/ID.IR.rs and rt is in IF/ID.IR.rt
- During the ID stage of DADD, rt of LD is in ID/EX.IR.rt
The following data hazard in the DSUB instruction can be removed by forwarding:
```
     LD    R1, 45(R2)
     DADD  R5, R6, R7
     DSUB  R8, R1 R7
```
- This can be detected in the EX stage of the DSUB by comparing the rt of the LD to the rs or rt of DSUB
- In this case in the EX stage of DSUB, the ALU must be fed not from the ID/EX register but from the load result in MEM/WB.
- Figure C.27 shows the new data paths needed and the new muxes for the ALU.

Next Notes

Back to CS 3853 Notes Table of Contents