CS 3853 Computer Architecture Notes on Appendix C Section 3
Read Appendix C.3
C3: Pipeline Implementation
We start with a simple unpipelined implementation of a subset of the MIPS instructions.
Unpipelined Implementation
We consider the following 5 types on instructions:
- register-register ALU (result in another register)
- register-immediate ALU (result in another register)
- load (register with displacement addressing)
- store (register with displacement addressing)
- conditional branch instruction that branches if a register is 0
The following information is from Figure A-22.
All instructions are 32 bits and these instructions have one of 2 formats:
I-type:

Used for:
- load: Regs[rt] ← Mem[Regs[rs] + Imm]
- store: Mem[Regs[rs] + Imm] ← Regs[rt]
- branch: if (Regs[rs] == 0) PC ← PC + (Imm << 2)
- Immediate ALU: Regs[rt] ← Regs[rs] op Imm
R-type:

Used for:
- RR ALU: Regs[rd] ← Regs[rs] funct Regs[rt]
Today's News: September 24
I will have shortened office hours on Thursday from 1pm until 1:40pm
I will also be available before 11:15am by appointment.
Examples:
Figure C.21
shows the hardware needed to implement these instructions in 5 or fewer cycles.
ClassQue: Figure C.21 logic types
Here is what happens at each cycle:
- IF
- IR ← Mem[PC]
- NPC ← PC + 4
- ID
- A ← Regs[rs]
- B ← Regs[rt]
- Imm ← sign-extended field of IR
- EX
- Load or Store:
ALUOutput ← A + Imm
- RR ALU:
ALUOutput ← A funct B
- R-Imm ALU
ALUOutput ← A op Imm
- Branch:
ALUOutput ← NPC + (Imm << 2)
Cond ← (A == 0)
ClassQue: Figure C.21 Muxes A and B
- MEM
- if Branch and cond PC ← ALUOutput
otherwise PC ← PC + 4
- Load:
LMD ← Mem[ALUOutput]
- Store:
Mem[ALUOutput] ← B
- WB
- Load:
Regs[rt] ← LMD
- RR ALU:
Regs[rd] ← ALUOutput
- R-Imm ALU:
Regs[rt] ← ALUOutput
Question:
ClassQue: Figure C.21 Instruction Encoding
The RR instruction is described as:
RR ALU: Regs[rd] ← Regs[rs] funct Regs[rt]
What would have to change if instead it were:
RR ALU: Regs[rs] ← Regs[rt] funct Regs[rd]
Answer:
?
Pipelined Implementation
Figure C.22
shows a corresponding pipeline implementation.
The registers NPC, IR, A, B, Imm, Cond, ALUOutput and LMD are now contained in the pipeline registers.
Examples:
- NPC is contained in which pipeline register?
Answer:
NPC is created in IF so it it stored in the IF/ID register.
It is needed in EX, so it must be also be in ID/EX.
- IR is stored in which pipieline registers?
Answer:
Parts of the IR register are needed in each cycle, so for simplicity, the entire IR is propagated to each pipeline
register. This is somewhat inefficient.
Today's News: September 26
Exam 1 will be one week from today.
Examples:
Figure C.23 shows the details of the pipelined execution for each type of instruction.
Below is a comparison for the RR ALU instruction. See
Figures C.21 and C.22
Operations that are performed, but not needed for this instruction are shown this way:
operation.
Stage | Unpipelined | Pipielined |
IF |
IR ← Mem[PC]
NPC ← PC + 4
|
IF/ID.IR ← Mem[PC]
PC ← PC + 4
IF/ID.NPC ← PC + 4
|
ID |
A ← Regs[IR.rs]
B ← Regs[IR.rt]
Imm ← sign-extended(IR.Immediate)
|
ID/EX.A ← Regs[ID/IF.IR.rs]
ID/EX.B ← Regs[ID/IF.IR.rt]
ID/EX.NPC ← IF/ID.NPC
ID/EX.IR ← ID/ID.IR
ID/EX.Imm ← sign-extended(IF/ID.IR.Immediate)
|
EX |
ALUOutput ← A funct B
|
EX/MEM.IR ← ID/EX.IR
EX/MEM.ALUOutput ← ID/EX.A funct ID/EX.B
|
MEM |
PC ← PC + 4
|
MEM/WB.IR ← EX/MEM.IR
MEM/WB.ALUOutput ← EX/MEM.ALUOutput
|
WB |
Regs[IR.rd] ← ALUOutput
|
Regs[MEM/WB.IR.rd] ← MEM/WB.ALUOutput
|
Note: My notation is slightly different from that of the book.
For the unpipelined case I use IR.rs instead of just rs, etc.
For the pipelined case I use XX/XX.IR.rs instead of XX/XX.IR[rs]
ClassQue: Exam 1 Sample Problems 1
ClassQue: Exam 1 Sample Problems 2
problems
solutions
How Branches Work
Branches are hard.
- We already know that branches can cause stalls.
- The problem is that we might not know the branch address or if the branch is taken
until one or more additional instructions have been fetched, and possibly executed.
- We are saved by the fact that the external state (what programs see) is not changed until MEM or WB.
In the unpipelined architecture shown in
Figure C.21:
- the NPC stores the potential new PC during IF
- the branch address and whether the branch is taken is computed in EX
- the PC is updated in MEM
- for a branch, the instruction is complete after the MEM cycle.
The the pipelined architecture shown in
Figure C.22 has
a 3-cycle stall when a branch is taken:
Suppose the instruction stream looks like:
instruction (not branch)
instruction (not branch)
instruction (not branch)
instruction A: taken branch
instruction B
instruction C
instruction D
...
instruction X: branch target
The PC is set at the end of IF to either PC+4 (normally) or if the Zero? field of EX/MEM is not 0 it is set to the ALU result
The Zero? field of EX/MEM stays 0 until the branch instruction is executed.
If the branch instruction is fetched in cycle i:
- cycle i:
- IF of A: taken branch is fetched
- IF of A: branch instruction stored in IF/ID
- IF of A: PC + 4 stored in PC (address of instruction i+1)
- cycle i+1:
- IF of B: instruction B at i+1 is fetched
- IF of B: PC + 4 is stored in PC (address of instruction i+2)
- ID of A: branch base register stored in ID/EX
- ID of A: branch destination offset is stored in ID/EX
- ID of A: branch instruction is stored in ID/EX (from IF.ID)
- cycle i+2:
- IF of C: instruction C at i+2 is fetched
- IF of C: PC + 4 is stored in PC (address of instruction i+3)
- ID of B: instruction B at i+1 is decoded
- EX of A: branch instruction Zero? stored in EX/MEM (this is 1 since the branch is taken)
- EX of A: branch destination stored in EX/MEM
- cycle i+3:
- cicle i+4:
- IF: branch destination is fetched
Today's News: October 1
Exam 1 will be on Thursday.
You can find the figure sheet for the exam
here.
You will not need to use the bottom figure which
shows the hardware needed to eliminate branch stalls.
Recitation this week
has many sample problems,
most of which will not be covered in the recitation.
Solutions for these problems will be available this afternoon.
Examples:
The timing diagram looks like this:
instruction | cycle i | cycle i+1 | cycle i+2 | cycle i+3 | cycle i+4 | cycle i+5 | cycle i+6 | cycle i+7 | cycle i+8 |
instruction A (taken branch) | IF | ID | EX | MEM | WB | |
instruction B | | IF | ID | EX | MEM | WB | |
instruction C | | | IF | ID | EX | MEM | WB | |
instruction D | | | | IF | ID | EX | MEM | WB | |
instruction X (branch destination) | | | | | IF | ID | EX | MEM | WB | |
- The PC is changed at the end each cycle and is either PC+4 or the ALU output depending on what is in the MEM/EX register which
was set on the previous cycle in MEM.
- The branch instruction sets this in cycle i+3 so it affects the fetch in cycle i+4
- We inhibit the MEM and WB actions in the next 3 instructions so the effect is that these are not executed.
- This produces 3 stalls for each taken branch.
ClassQue: Pipeline Branch
Reducing the branch penalty
Figure C.28
shows how to reduce the branch taken penalty from 3 to 1.
Figures C.22 and C.28 compared
- Must know if branch is taken in ID, rather than EX
- Zero? is done in ID rather than EX
- This is easy if with only have branch on zero or nonzero
- Requires more hardware if branch on compare 2 registers
- Must compute branch address in ID
- requires an adder in ID after the register file read
- might increase the clock cycle time, but result is not fed into ID/EX
- Must feed results of Add and Zero? directly into PC mux rather than into ID/EX to save one cycle
The timing diagram now looks like this:
instruction | cycle i | cycle i+1 | cycle i+2 | cycle i+3 | cycle i+4 | cycle i+5 | cycle i+6 |
instruction A (taken branch) | IF | ID | EX | MEM | WB | |
instruction B | | IF | ID | EX | MEM | WB | |
instruction X (branch destination) | | | IF | ID | EX | MEM | WB | |
Questions:
- Why do we not strike out the ID and EX of instruction B?
Answer:
We do not have to since they do not change the external state.
- Why don't with strike out the MEM and WB for instruction A?
Answer:
A branch instruction does not do anything in these stages.
Dealing with data hazards
Recall that there are 3 types of hazards: structural, data, and control.
Structural hazards will not occur because we included enough hardware.
The above discussion showed how to handle control hazards.
When a data hazard occurs, we need to either stall the pipeline, or elimintate the hazard by using forwarding.
ClassQue: forwarding hardware
Examples:
- The following requires a stall of the DADD instruction:
LD R1, 45(R2)
DADD R5, R1, R7
- This can be detected in the ID stage of the DADD instruction by comparing rt
of the LD instruction to rs and rt of the DADD instruction.
- During the ID stage of DADD, rs is in IF/ID.IR.rs and rt is in IF/ID.IR.rt
- During the ID stage of DADD, rt of LD is in ID/EX.IR.rt
- The following data hazard in the DSUB instruction can be removed by forwarding:
LD R1, 45(R2)
DADD R5, R6, R7
DSUB R8, R1 R7
- This can be detected in the EX stage of the DSUB by comparing the rt of the LD to the rs or rt of DSUB
- In this case in the EX stage of DSUB, the ALU must be fed not from the ID/EX register but from the load result in MEM/WB.
- Figure C.27
shows the new data paths needed and the new muxes for the ALU.