CS 3853 Architecture Notes on Appendix B Section 2

Today's News: October 5, 2015

Assignment 2 is available

Read Appendix B.2

B.2: Cache Performance

Hit ratio or miss ratio alone are not a good measure of cache performance.
We will use Average memory access time = Hit time + Miss rate × Miss penalty.
Note that this is only a measure of the memory system and not the performance of an entire computer system.
This chapter has little new information, but gives a number of examples of calculating cache performance.
Here are some of the issues presented:
1. Advantages of separate instruction and data caches
  - With a unified single-port cache, there is a structural hazard on each load or store
  - This requires a stall cycle on each load or store.
2. How much associativity is best?
  - the higher the associativity, the lower the miss rate (up to a point).
  - the higher the associativity, the more hardware is needed (increased cost).
  - with higher associativity, may have to reduce the clock rate.
3. The higher the clock rate and lower the CPI, the more the effect of cache misses.
  (With a higher clock rate, the miss penalty is a larger number of cycles.)
4. Does average memory access time predict processor performance?
  - Cache performance has no affect on I/O
5. Miss penalty has smaller effect for processors that support out-of-order execution.

Cache Performance 1

Example 1

Compare the miss ratios and access times of:

16KB instruction cache and 64KB data cache
256KB unified cache

Make reasonable assumptions to solve the problem.
Solution:

Assumptions:

Miss rates per 1000 instructions are given in Figure B.6 (on page B-15) as follows:
16KB instruction: 3.82
64KB data: 36.9
256KB unified: 32.9
These assume 36% instructions are loads and stores, as with some SPEC benchmarks.
Assume a 2-way set associative cache with 64-byte blocks.
A hit takes 1 cycle
Miss penalty is 50 cycles
A load or store takes an extra cycle because the the structural hazard in the case of the unified cache.
Ignore stalls due to write-through.

miss ratio_split = (3.82+36.9)/(1.36 × 1000) = .02994
miss ratio_unified = 32.9/(1.36 × 1000) = .02419
The unified miss ratio is better!

This does not take into account the extra stall due to the structural hazard in the unified cache.
To calculate the average memory access time:

instruction access time = hit time + miss ratio × miss penalty

access time_split = 1 + .02994 × 50 = 2.497 cycles.
access time_unified = 1 + .36 + .02419 × 50 = 2.57 cycles.
The split access time is better!

The next example explores the performance of direct mapped and set associative caches.
For a given size cache, the more associativity, the higher the hit ratio.
More associativity requires additional hardware (and time) to check a tag (even on a hit)
This might require increasing the clock cycle time.
Example 2

Which is faster, a direct mapped cache with a cycle time of .4 ns, or
a 2-way set associative cache with a cycle time of .45 ns?
We need some additional assumptions to do this problem:

1.3 memory accesses per instruction
CPI of 1 with no cache misses
miss penalty of 21 ns
miss rate of direct mapped cache: 2.3%
miss rate of 2-way set associative cache: 2.1%
these are unified caches, but with no structural hazard

Solution

First, we need to know the miss penalty in cycles for each:

miss penalty_direct = 21ns/.4ns = 52.5 cycles
miss penalty_2-way = 21ns/.45ns = 46.67 cycles

We round up the number of cycles for the miss penalty.
Second we calculate CPI for each:

CPI_direct = 1 + 1.3 × .023 × 53 = 2.5847
CPI_2-way = 1 + 1.3 × .021 × 47 = 2.2831

What we really want it time:

Time per instruction_direct = 2.5847 × .4 ns = 1.0339 ns.
Time per instruction_2-way = 2.2831 × .45 ns = 1.0274 ns.

In this case the 2-way cache is better by .6%.

With out-of-order execution, part of the miss penalty can be overlapped with the execution of other instructions.
Example 3

Redo the above problem if the 30% of the miss penalty can be overlapped.
Solution:

We just have to reduce the miss penalty by 30% in each case.

CPI_direct = 1 + 1.3 × .023 × 53 × .7 = 2.1093
CPI_2-way = 1 + 1.3 × .021 × 47 × .7 = 1.8982

What we really want it time:

Time per instruction_direct = 2.1093 × .4 ns = .8437 ns.
Time per instruction_2-way = 1.8982 × .45 ns = .8542 ns.

In this case the direct mapped cache is faster by 1.25%.