Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
those neighboring bytes are the most likely to be the ones it will need to
process next.
Cache blocks form the basic unit of cache organization, and RAM is also
organized into blocks of the same size as the cache’s blocks. When a block is
moved from RAM to the cache, it is placed into a special slot in the cache
called a
block frame
. Figure 11-5 shows a set of cache blocks stored in RAM and a cache with an empty set of block frames.
Block Frames
L1 Cache
0 1 2 3 4 5 6 7
Blocks
RAM
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19
Figure 11-5: Blocks and block frames
Understanding Caching and Performance
223
Cache designers can choose from a few different schemes for governing
which RAM blocks can be stored in which of the cache’s block frames. Such a
scheme is called a
cache placement policy
, because it dictates where in the cache a block from memory can be placed.
Tag RAM
When the CPU requests a byte from a particular block from RAM, it needs to
be able to determine three things very quickly:
z
whether or not the needed block is actually in the cache (i.e., whether
there is a cache hit or a cache miss)
z
the location of the block within the cache (in the case of a cache hit)
z
the location of the desired byte (or critical word) within the block
(again, in the case of a cache hit)
A cache accommodates all three needs by associating a special piece
of memory—called a tag—with each block frame in the cache. The
tag
holds information about the blocks currently being stored in the frame,
and that information allows the CPU to determine the answer to all three
of the questions above. However, the speed with which that answer comes
depends on a variety of factors.
Tags are stored in a special type of memory called the
tag RAM
. This
memory has to be made of very fast SRAM because it can take some time
to search it in order to locate the desired cache block. The larger the cache,
the greater the number of blocks, and the greater the number of blocks, the
more tag RAM you need to search and the longer it can take to locate the
correct block. Thus the tag RAM can add unwanted access latency to the
cache. As a result, you not only have to use the fastest RAM available for
the tag RAM, but you also have to be smart about how you use tags to map
RAM blocks to block frames. In the following section, I’ll introduce the
three general options for doing such mapping, and I’ll discuss some of the
pros and cons of each option.
Fully Associative Mapping
The most conceptually simple scheme for mapping RAM blocks to cache block
frames is called
fully associative mapping
. Under this scheme, any RAM block can be stored in any available block frame. Fully associative mapping is depicted
in Figure 11-6, where any of the red RAM blocks can go into any of the red
cache block frames.
The problem with fully associative mapping is that if you want to retrieve
a specific block from the cache, you have to check the tag of every single block frame in the entire cache because the desired block could be in any of the
frames. Since large caches can have thousands of block frames, this tag
searching can add considerable delay (latency) to a fetch. Furthermore, the
larger the cache, the worse the delay gets, since there are more block frames
and hence more block tags to check on each fetch.
224
Chapter 11
Direct Mapping
Another, more popular way of organizing the cache is to use
direct mapping
.
In a
direct-mapped cache
, each block frame can cache only a certain subset of the blocks in main memory.
Block Frames
L1 Cache
0 1 2 3 4 5 6 7
Blocks
RAM
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19
Figure 11-6: Fully associative mapping
In Figure 11-7, each of the red blocks (blocks 0, 8, and 16) can be
cached only in the red block frame (frame 0). Likewise, blocks 1, 9, and 17
can be cached only in frame 1, blocks 2, 10, and 18 can be cached only in
frame 2, and so on. Hopefully, the pattern here is apparent: Each frame
caches every eighth block of main memory. As a result, the potential number
of locations for any one block is greatly narrowed, and therefore the number of
tags that must be checked on each fetch is greatly reduced. For example,
if the CPU needs a byte from blocks 0, 8, or 16, it knows that it only has to check the tag associated with frame 0 to determine if the desired block is in the
cache and to retrieve it if it is. This is much faster and more efficient than
checking every frame in the cache.
There are some drawbacks to this scheme, though. For instance,
what if blocks 0 to 3 and 8 to 11 combine to form an eight-block working set
that the CPU wants to load into the cache and work on for a while? The
cache is eight blocks long, but since it’s direct-mapped, it can only store
four of these particular blocks at a time. Remember, blocks 0 and 8 have
to go in the same frame, as do blocks 1 and 9, 2 and 10, and 3 and 11.
As a result, the CPU must load only four blocks of this eight-block set at
a time, and swap them in and out as it works on different parts of the set.
If the CPU wants to work on this whole eight-block set for a long time,
that could mean a lot of swapping. Meanwhile, half of the cache is going
completely unused! While direct-mapped caches are almost always faster
Understanding Caching and Performance
225
than fully associative caches due to the shortened amount of time it
takes to locate a cached block, they can still be inefficient under some
circumstances.
Block Frames
L1 Cache
0
1
2
3
4
5
6
7
Blocks
RAM
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19
Figure 11-7: Direct mapping
Note that the kind of situation described here, where the CPU would
like to store multiple blocks but it can’t because they all require the same
frame, is called a
collision
. In the preceding example, blocks 0 and 8 are said to collide, since they both want to fit into frame 0 but can’t. Misses that result from such collisions are called
conflict misses
, the second of the three types of cache miss that I mentioned earlier.
N
-Way Set Associative Mapping
One way to get some of the benefits of direct-mapped caches while lessening
the amount of cache space wasted due to collisions is to restrict the caching
of main memory blocks to a subset of the available cache frames. This tech-
nique is called
set associative mapping
, and a few popular implementations of it are described below.
Four-Way Set Associative Mapping
To see an example of what it means to restrict main memory blocks in a
subset of available cache frames, take a look at the Figure 11-8, which
illustrates
four-way set associative mapping
.
In Figure 11-8, any of the red blocks can go anywhere in the red set of
frames (set 0) and any of the light yellow blocks can go anywhere in the light
yellow set of frames (set 1). Think of the four-way associative cache like this: You took a fully associative cache and cut it in two, restricting half the main
memory blocks to one side and half the main memory blocks to the other.
226
Chapter 11
Block Frames
L1 Cache
0
1
Blocks
RAM
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19
Figure 11-8: Four-way set associative mapping
This way, the odds of a collision are greatly reduced versus the direct-mapped
cache, but you still don’t have to search all the tags on every fetch like you
did with the fully associative cache. For any given fetch, you need search only
a single, four-block set to find the block you’re looking for, which in this case amounts to half the cache.
The cache pictured in Figure 11-8 is said to be four-way
set associative
because the cache is divided into
sets
of four frames each. Since this cache has only eight frames, it can accommodate only two four-frame sets. A larger
cache could accommodate more four-frame sets, reducing the odds of a
collision even more.
Figure 11-9 shows a four-way set associative cache like the one in
Figure 11-8, but with three sets instead of two. Notice that there are fewer
red main memory blocks competing for space in set 0, which means lower
odds of a collision.
In addition to its decreased odds of a collision, a four-way set associative
cache has a low access latency because the number of frames that must be
searched for a block is limited. Since all the sets consist of exactly four frames, no matter how large the cache gets, you’ll only ever have to search through
four frames (or one full set) to find any given block. This means that as the
cache gets larger and the number of sets that it can accommodate increases,
the tag searches become more efficient. Think about it. In a cache with three
sets, only one-third of the cache (or one set) needs to be searched for a given
block. In a cache with four sets, only one-fourth of the cache is searched.
In a cache with one hundred four-block sets, only one-hundredth of the
cache needs to be searched. The relative search efficiency scales with
the cache size.
Understanding Caching and Performance
227
Block Frames
L1 Cache
0
1
2
Blocks
RAM
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19
Figure 11-9: Four-way set associative mapping with three block frames
Two-Way Set Associative Mapping
Another way to increase the number of sets in the cache without actually
increasing the cache size is to reduce the number of blocks in each set.
Check out Figure 11-10, which shows a two-way set associative cache.
Block Frames
L1 Cache
0
1
2
3
Blocks
RAM
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19
Figure 11-10: Two-way set associative mapping
228
Chapter 11
To get a better idea of what’s going on, let’s compare the two-way asso-
ciative cache to both the direct-mapped and the four-way cache. For the sake
of comparison, assume that the cache size and the memory size both stay
constant. And as you read the comparison, keep in mind that since each
increase in the level of associativity (i.e., from two-way to four-way, or from
four-way to eight-way) also increases the number of tags that must be checked
in order to locate a specific block, an increase in associativity also means an
increase in the cache’s latency.
Two-Way vs. Direct-Mapped
With the two-way cache, the number of potential collisions (and hence the
miss rate) is reduced compared to the direct-mapped scheme. However,
the number of tags that must be searched for each fetch is twice as high.
Depending on the relative sizes of the cache and main memory, this may or
may not increase the cache’s overall latency to the point that the decreased
conflict miss rate is worth it.
Two-Way vs. Four-Way
Though a two-way cache’s latency is less than that of a four-way cache, its
number of potential collisions (and hence its miss rate) is higher. Just as with the preceding comparison, how a two-way associative cache compares to a
four-way associative cache depends on just how much latency the four-way
cache’s increase in associativity ends up costing versus the decrease in
conflict miss rate.
Associativity: Conclusions
In general, it turns out that when you factor in current technological
conditions (the speed of tag RAM, the range of sizes that most caches fall