Today's News: September 10
Recitations will start this week.
Read Appendix C.1
C1: Introduction
Consider a traditional processor in which instructions are executed as follows:
- Instruction fetch (IF): read the instruction from memory and update the PC
- Instruction decode (ID): decode the instruction and read the source registers
- Execution (EX): execute, e.g. perform ALU operation (may be effective address calculation)
- Memory access (MEM): If this was a load, read from memory, if a store, write to memory
- Write-back (WB): Write the result to the destination register
If we execute one instruction per cycle, the cycle time needs to be long enough to perform all of these steps on the longest
instruction.
Alternatively, we can execute each part in a cycle, which makes the cycle time shorter, but some instructions will require
as many as 5 cycles.
The cycle time needs to be long enough for the slowest of these steps.
A 5-cycle instruction will execute slower, but some instructions will take fewer cycles.
ClassQue: Pipeline Timing 1
We will consider this second approach for now.
What fraction of the time is the ALU being used?
How can we improve the performance?
The idea of pipelining: Fetch the next instruction while decoding the previous instruction.
Instead of a throughput of 1 instruction every 5 cycles, we could get one per cycle after an initial delay.
Important:
- You must know the above 5 steps
- You must be able to give them in order using the 2 or 3 letter description:
IF, ID, EX, MEM, WB
- You must know the names of each step:
Instruction fetch, Instruction decode, Execution, Memory access, Write-back
- You must be able to describe in general what each step does
- For each of the following major types of 4-byte RISC instructions,
you must be able to describe in detail what happens at each step:
- register-register ALU (result in another register)
- register-immediate ALU (result in another register)
- load (register with displacement addressing)
- store (register with displacement addressing)
- conditional branch instruction that compares 2 registers
See pages C5 and C6 of the text.
The MIPS instruction set
- We will base our pipeline discussion on the MIPS 64-bit instruction set.
- This is a RISC instruction set.
- All operations on data apply to data in registers and typically change the entire register.
- Only load and store instructions access memory.
- Memory instructions can operate on 8, 16, 32, or 64 bits.
- Almost all instructions are 32 bits in length.
- 32 general purpose registers with R0 always 0.
- ALU instructions have 3 operands, either all registers, or 2 registers and an immediate value:
- DADD R1, R2, R3: R1 = R2 + R3
- DADDIU R1, R2, #3: R1 = R2 + 3 (unsigned)
- Load and store instructions use base register with displacement addressing.
- LD R1, 30(R2): 64-bits of memory loaded into R1
- SD R1, 30(R2): 64-bits of memory stored from R1
- Note that source or destination register is first and memory address second.
- Branches and Jumps: branches are conditional and the instruction stores the offset from the current PC
MIPS can use either condition codes or direct register comparison. We only consider the latter.
- BEQZ R1, name: branch if R1 is 0
- BNE R1, R2, name: branch if R1 is not equal to R2
- For now we do not need to know the details which are contained in Appendix A.
- All of these instructions can be executed in 5 cycles or fewer using IF, ID, EX, MEM, WB
- Branch instructions require only 2 cycles.
- Store instructions require only 4 cycles.
- All other instructions take 5 cycles.
Example
Describe the execution of
DADDIU R1, R2, #3
at each of the 5 execution stages.
Solution:
- IF:
- Send the PC to the instruction memory and fetch the next instruction.
- Add 4 to the PC (length of the instruction)
- ID:
- decode the instruction
- get R2 from the register file
- sign-extend the immediate value in from the instruction
- EX:
- send the value of R2 and the sign-extended immediate value to the ALU to perform the add
- MEM:
- WB:
- write the result to R1 in the register file
Problems ClassQue: 5 stage pipeline 1
- Describe the execution of
LD R1, 30(R2)
in the EX stage.
- In the ID step, we decode the instruction and read the source registers.
Describe in words what is meant by decode the instruction.
The classic 5-stage pipeline
The simple 5-stage pipeline looks like this:
| clock number |
Instruction number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Instruction i | IF | ID | EX | MEM | WB | |
Instruction i+1 | | IF | ID | EX | MEM | WB | |
Instruction i+2 | | | IF | ID | EX | MEM | WB | |
Instruction i+3 | | | | IF | ID | EX | MEM | WB | |
Instruction i+4 | | | | | IF | ID | EX | MEM | WB | |
Starting with clock number 5, one instruction completes per clock cycle.
Figure C.2
shows the hardware needed to support each stage of the pipeline.
We need to make sure that a piece of hardware does not need to do 2 things at once.
For example, an ADD will need to use the ALU in stage 3 (the EX stage) and a LD will need to compute an effective address
(by adding a displacement to a register) in the same stage.
However, a branch will need to computer the branch address
in stage 2 (ID) which requires an adder.
We also need an adder in stage 1 (IM) to increment the PC.
- IF accesses the Instruction Memory (IM), but also needs to increment the program counter (needs an adder)
- ID needs to read from the register file (but not write to it) and possibly compute a branch address.
- EX does an ALU operation or calculates a memory address (not both since this is RISC)
- MEM accesses memory if this is a load or store
- WB: if ALU or load, write to the register file.
Note that the register file is accessed in both ID and WB.
We assume that we can read and write in the same cycle.
In fact, we write at the beginning of the cycle and read at the end of the cycle.
Today's News: September 12
Recitation 2 is available.
Implementation requirement: pipeline registers
At each stage, certain values need to be saved so they do not change during the next stage.
Example: ALU is combinational logic
- This means that the outputs can change soon after the inputs change
- In Figure C.1, the same ALU is used every clock cycle
- A value computed in stage 3 (EX) might be used in stage 4 (MEM) or stage 5 (WB).
- A register (sequential logic) can hold the results after each clock cycle, until the next one.
Figure C.3
shows the pipeline registers required.
Questions: ClassQue: Figure C.3 ALU and Pipeline
- In Figure C.3, how many different ALU's are shown?
Answer:
- In Figure C.3, how many different pipeline registers are shown?
Answer:
Example
Describe what needs to be stored in each of the pipeline registers during the execution of
DADDIU R1, R2, #3
Solution:
It is easier to do this backwards, starting with the MEM/WB to make sure everything that is needed propagates.
Only values from the previous pipeline register and those computed at the current stage are available to be saved.
Look at the previous example describing what is needed at each stage.
- IF/ID: The fetched instruction
- ID/EX: the value of R2, the sign-extended immediate value,
and the address of the R1 register from the IF/ID pipeline register.
- EX/MEM: the result of the ALU operation and the address of the R1 register, both from the ID/EX pipeline register.
- MEM/WB: the result of the ALU operation and the address of the R1 register, both from the EX/MEM pipeline register.
Pipeline performance
- Pipelining increases throughput
- Pipeline does not reduce latency
- Usually pipelining increases latency (slightly)
- Clock runs at a rate determined by the slowest stage in the pipeline.
Examples
- A unpipelined machine has a 1 ns clock.
All instructions take 5 cycles, except for branches which take 2 cycles.
Branches are 30% of all instructions.
What is the speedup obtained by using a pipelined design if the pipelining
increases the clock cycle time to 1.5 ns?
Solution:
Unpipelined CPI = .7 × 5 + .3 × 2 = 4.1
Unpipelined average instruction execution time: 1 ns × 4.1 = 4.1 ns.
Pipelined average instruction execution time: 1.5 ns.
Speedup =
Average instruction execution time unpipelined
Average instruction execution time pipelined =
4.1
1.5 = 2.733.
- Why did we ignore the latency of the pipelined machine in the above solution?