Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (91 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
3.37Mb size Format: txt, pdf, ePub

into, and so on), some level of associativity less than or equal to eight-way

turns out to be optimal for most caches. Any more than eight-way associativity

and the cache’s latency is increased so much that the decrease in miss rate

isn’t worth it. Any less than two-way associativity and the number of collisions often increases the miss rate to the point that the decrease in latency isn’t

worth it. There are some direct-mapped caches out there, though.

Before I conclude the discussion of associativity, there are two minor

bits of information that I should include for the sake of completeness. First,

though you’ve probably already figured it out, a direct-mapped cache is simply

a one-way set associative cache, and a fully associative cache is an
n
-way set associative cache, where
n
is equal to the total number of blocks in the cache.

Second, the cache placement formula, which enables you to compute

the set in which an arbitrary block of memory should be stored in an
n
-way associative cache, is as follows:

(
block_address
) MOD (
number_of_sets_in_cache
)

Understanding Caching and Performance

229

I recommend trying out this formula on some of the preceding examples.

It might seem like a boring and pointless exercise, but if you take five minutes and place a few blocks using this simple formula in conjunction with the

preceding diagrams, everything I’ve said in this section will really fall into

place for you in a “big picture” kind of way. And it’s actually more fun to do

that it probably sounds.

Temporal and Spatial Locality Revisited: Replacement/Eviction

Policies and Block Sizes

Caches can increase the amount of benefit they derive from temporal locality

by implementing an intelligent
replacement policy
(also called, conversely, an
eviction policy
). A replacement policy dictates which of the blocks currently in the cache will be replaced by any new block that gets fetched in. (Or, another

way of putting it is that an eviction policy dictates which of the blocks currently in the cache will be evicted in order to make room for any new blocks that

are fetched in.)

Types of Replacement/Eviction Policies

One way to implement a replacement policy would be to pick a block at

random to be replaced. Other possible replacement policies would be a

FIFO policy, a LIFO policy, or some other such variation. However, none of

these policies take into account the fact that any block that was recently used

is likely to be used again soon. With these simple policies, you wind up evicting blocks that will be used again shortly, thereby increasing the cache miss rate

and eating up valuable memory bus bandwidth with a bunch of unnecessary

fetches.

The ideal replacement policy would be one that makes room for an

incoming block by evicting the cache block that is destined to go unused for

the longest period of time. Unfortunately, computer architects haven’t yet

devised a way of infallibly predicting the future, so there’s no way to know for sure which of the blocks that are currently residing in the cache is the one

that will go the longest time before being accessed again.

Even if you can’t predict the future, you can make an educated guess

based on what you know of the cache’s past behavior. The optimal replace-

ment policy that doesn’t involve predicting the future is to evict the block

that has gone the longest period of time without being used, or the
least

recently used (LRU) block
. The logic here is that if a block hasn’t been used in a while, it’s less likely to be part of the current working set than a block that was more recently used.

An LRU algorithm, though ideal, isn’t quite feasible to implement in

real life. Control logic that checks each cache block to determine which one

is the least recently used not only would add complexity to the cache design,

but such a check would also take up valuable time and thereby add unwanted

latency to each replacement. Most caches wind up implementing some type

of
pseudo-LRU algorithm
that approximates true LRU by marking blocks as

more and more
dirty
the longer they sit unused in the cache. When a new

block is fetched into the cache, the dirtiest blocks are the first to be replaced.

230

Chapter 11

Sometimes, blocks that aren’t all that dirty get replaced by newer blocks,

just because there isn’t enough room in the cache to contain the entire

working set. A miss that results when a block containing needed data has

been evicted from the cache due to a lack of cache capacity is called a

capacity miss
. This is the third and final type of cache miss.

Block Sizes

In the section on spatial locality I mentioned that storing whole blocks is one

way that caches take advantage of spatial locality of reference. Now that you

know a little more about how caches are organized internally, you can look a

closer at the issue of block size.

You might think that as cache sizes increase, you can take even better

advantage of spatial locality by making block sizes even bigger. Surely fetching more bytes per block into the cache would decrease the odds that some part

of the working set will be evicted because it resides in a different block. This is true to some extent, but you have to be careful. If you increase the block

size while keeping the cache size the same, you decrease the number of

blocks that the cache can hold. Fewer blocks in the cache means fewer sets,

and fewer sets means that collisions and therefore misses are more likely.

And of course, with fewer blocks in the cache, the likelihood decreases that

any particular block that the CPU needs will be available in the cache.

The upshot of all this is that smaller block sizes allow you to exercise

more fine-grained control of the cache. You can trace out the boundaries of

a working set with a higher resolution by using smaller cache blocks. If your

cache blocks are too large, you wind up with a lot of wasted cache space,

because many of the blocks will contain only a few bytes from the working

set while the rest is irrelevant data. If you think of this issue in terms of

cache pollution, you can say that large cache blocks are more prone to

pollute the cache with non-reusable data than small cache blocks.

Figure 11-11 shows the memory map we’ve been using, sitting in a cache

with large block sizes.

Figure 11-11: Memory map with large block sizes

Understanding Caching and Performance

231

Figure 11-12 shows the same map, but with the block sizes decreased.

Notice how much more control the smaller blocks allow over cache pollution.

The smaller cache blocks have a higher average ratio of red to blue in each

block, which means that it’s easier to keep the precise contours of the work-

ing set in the cache.

Figure 11-12: Memory map with small block sizes

The other problems with large block sizes are bandwidth related. The

larger the block size, the more data is fetched with each load, so large block

sizes can really eat up memory bus bandwidth, especially if the miss rate is

high. A system therefore needs plenty of bandwidth if it’s going to make

good use of large cache blocks. Otherwise, the increase in bus traffic can

increase the amount of time it takes to fetch a cache block from memory,

thereby adding latency to the cache.

Write Policies: Write-Through vs. Write-Back

So far, this entire chapter has dealt with only one type of memory traffic:

loads
, or requests for data from memory. I’ve talked only about loads because they make up the vast majority of memory traffic. The remainder of memory

traffic is made up of stores, which in simple uniprocessor systems are much

easier to deal with. In this section, I’ll explain how to handle stores in single-processor systems with just an L1. When you throw in more caches and multiple

processors, things get more complicated than I want to go into here.

Once a retrieved piece of data is modified by the CPU, it must be stored

or written back out to main memory so that the rest of the system has access

to the most up-to-date version of it.

There are two ways to deal with such writes to memory. The first way is

to immediately update all the copies of the modified data in each level of

the cache hierarchy to reflect the latest changes. A piece of modified data

would be written to both the L1 and main memory so that all of its copies

are current. Such a policy for handling writes is called
write-through
, since it writes the modified data through to all levels of the hierarchy.

232

Chapter 11

A write-through policy can be nice for multiprocessor and I/O–intensive

system designs, since multiple clients are reading from memory at once and

all need the most current data available. However, the multiple updates per

write required by this policy can greatly increase memory traffic. For each

store, the system must update multiple copies of the modified data. If there’s

a large amount of data that has been modified, this increased write activity

could eat up quite a bit of memory bandwidth that could be used for the

more important load traffic.

The alternative to write through is
write-back
, and it can potentially result in less memory traffic. With a write-back policy, changes propagate down to

the lower levels of the hierarchy as cache blocks are evicted from the higher

levels. An updated piece of data in an L1 cache block will not be updated in

main memory until that block is evicted from the L1. The trade-off for write-

back is that it’s more complex to keep the multiple copies of the data in sync,

especially in a multiprocessor system.

Conclusions

There is much, much more that can be said about caching; this chapter has

covered only the basic concepts. As main memory moves farther away from

the CPU in terms of CPU clock cycles, the importance of caching will only

increase. For modern microprocessor systems, larger on-die caches have

turned out to be one of the simplest and most effective uses for the increased

transistor budgets that new manufacturing process technologies afford.

Understanding Caching and Performance

233

I N T E L ’ S P E N T I U M M , C O R E

D U O , A N D C O R E 2 D U O

In March 2003, Intel officially launched the Pentium M,

a new
x
86 processor designed specifically for the mobile

computing market. The Pentium M was a success from

both performance and power efficiency perspectives,

with the result that the immediate successor to the

Pentium M, called Core Duo, made its way out of the

portable niche and into desktop computers.

Core Duo’s combination of high performance and low power consump-

tion made it clearly superior to processors based on Intel’s aging Netburst

microarchitecture, and in the fall of 2005, Intel announced plans to unify its

entire
x
86 product line, from portables all the way up to servers, on the follow-up to Core Duo: a brand new, power-efficient, 64-bit microarchitecture called

Core. The desktop implementation of Core is the new Core 2 Duo processor

line from Intel, while server products based on Core are sold under the

venerable Xeon brand.

This chapter covers the major design aspects of the Pentium M, Core

Duo, and Core 2 Duo, comparing the microarchitectures of these three

related processors with the Intel processors covered in previous chapters.

I’ll talk about the major new features common to all three designs, like

micro-ops fusion and the improved branch predictor, as well as each

processor’s individual innovations. Finally, I’ll discuss memory instruction

reordering, and I’ll explain how a feature called memory disambiguation

enables Core 2 to perform a kind of speculative execution on a processor’s

outgoing results stream.

Code Names and Brand Names

As far back as the launch of the Pentium 4 line, Intel began emphasizing

the difference between microarchitecture and implementation. This differ-

ence, which is discussed in the section
“Processor Organization and Core

Microarchitecture” on page 247 has
taken on new importance with the introduction of multi-core processors like the Core Duo. A particular processor

from Intel might have three different names associated with it: a brand name

for the processor itself (e.g.,
Pentium 4
), a separate brand name for the microarchitecture that the processor implements (e.g.,
Netburst
), and a commonly used code name that refers to a particular variant (e.g.,
Prescott
) within a larger family of closely related microarchitectures that share a brand name.

(
Willamette
,
Northwood
,
Prescott
, and
Cedar Mill
are all variants of Intel’s Netburst microarchitecture that are used in different Pentium 4–branded

processors.) The reader (and the author of computer books) is therefore

forced to juggle three different commonly used names when referring to any

Other books

Undeceived by Karen M. Cox
The Last of the Firedrakes by Farah Oomerbhoy
Stone in the Sky by Cecil Castellucci
Pink Slip Prophet by Donnelly, George
Forged of Fire by Stacy Von Haegert
Genetic Attraction by Tara Lain