title | date | lastmod |
---|---|---|
Cache |
2022-11-08 |
2022-11-21 |
Memory in cache is stored as cache lines.
A CPU will try to access cache through a memory address:
Storing these separately will allow in better parallelism. CPU is able to fetch instructions from instruction cache while writing to data cache for STUR instructions.
The key factor affecting cache performance is the effects of cache misses. When a cache miss occurs, compute cycles are needed to find a victim, request the appropriate data from memory, fill the cache line with this new block and resume execution.
First reference to a given block of memory. This is an inevitable miss as data has not been accessed before.
This occurs when the current working set exceeds the cache capacity. Current useful blocks have to be replaced.
When useful blocks are displaced due to placement policies. E.g. fully associative cache mapping
- Number of blocks
- More blocks means that we will have larger capacity and result in lesser capacity misses
- Associativity
- Reduce conflict misses
- Increases access time
- Block size
- Larger blocks exploit spatial locality. More data in the same area is loaded together when requested.
- Reduces compulsory misses since more data is loaded at once.
- Reduces number of cache blocks for a fixed block size. This leads to increase in conflict miss as higher chance for different data map to the same blocks.
- Increases miss penalty as more data needs to be replaced on a miss.
- Levels of cache
An issue arises when different cores have cached values which share the same cache line. The picture below depicts how 2 cores try to access different independent values (X and Y) that resides on the same cache line in L3 cache.
If X and Y are highly used variables
- Writing to X invalidates the cache line in Core 2
- Writing to Y invalidates the cache line in Core 1
$$
\begin{align}
&CPU_{time}=(CPU_{\text{execution cycles}}+\text{Memory stall cycles})\times\text{Cycle Time}\
&CPU_{time}=((IC\times CPI)+(IC\times%\text{Memory Access}\times\text{Miss Rate}\times\text{Miss Penalty}))\times \text{Cycle Time}
\end{align}
$$
L1 cache hit can be considered to be part of CPI ideal as it is often possible to complete the data access within the ideal clock cycles.
We need a way to measure the performance of cache, standalone from the performance of the CPU.
AMAT is the average time to access memory considering both hits and misses
$$
\begin{aligned}
AMAT&=\text{Hit Time}\times(1-\text{Miss Rate})+\text{Miss Rate}\times(\text{Hit Time}+\text{Miss Penalty})\
&=\text{Time for hit}+\text{Miss Rate}\times\text{Miss Penalty}\
\end{aligned}
$$
Note here that Miss Penalty is the loss in cycles for a miss, and not just the cost of main memory access. e.g If time to hit cache = 1, time to hit main memory = 100, miss penalty is
One example to show that AMAT is superior would be to consider two different caches with similar miss rates, but drastically different hit times. Using the miss rate metric, we would rate both caches the same. Using the AMAT metric, a cache with a lower hit time or lower miss penalty will outperform a cache with a higher respective time, assuming all other variables are the same.
3 bit index, 2bit tag with 0 bit offset.
Hit rate: 2/8 = 25%
a.
$$
\begin{align}
&T=IC\times CPI\times Period\
&CPI_{ideal}=1.25\
&\text{L1 stall cycles}=0.2\times0.2\times8=0.32\
&\text{L2 stall cycles}=0.2\times0.2\times0.1\times30=0.12\
&\text{CPI stall}=1.25+0.32+0.12=1.69\
\end{align}
$$
b.
$$
\begin{align}
&\text{L1 stall cycles}=0.2\times0.2\times30=1.2\
&\text{CPI stall}=1.25+1.2=2.45
\end{align}
$$