Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (89 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
2.1Mb size Format: txt, pdf, ePub

and hard disk’s page file. Each level in the hierarchy subsets the level below

it. We’ll talk later about how all of those copies are kept in sync.

In Figure 11-3, the red cells are the code and data for the program that

the CPU is currently running. The blue cells are unrelated to the currently

running program. This figure, which will become even clearer once you read

“Locality of Reference” on page 220,
shows how each level of the cache hierarchy subsets the level below it.

As Table 11-1 indicates, each level of the hierarchy depicted in Figure 11-3

is controlled by a different part of the system. Data is promoted up the hier-

archy or demoted down the hierarchy based on a number of different criteria;

in the remainder of this chapter we’ll concern ourselves only with the top

levels of the hierarchy.

Example: A Byte’s Brief Journey Through the Memory Hierarchy

For the sake of example, let’s say the CPU issues a load instruction that tells

the memory subsystem to load a piece of data (in this case, a single byte)

into one of its registers. First, the request goes out to the L1, which is

218

Chapter 11

Registers

L1 Cache

L2 Cache

CPU

RAM

Main Memory

Figure 11-3: Code and data in the cache hierarchy

checked to see if it contains the requested data. If the L1 does not contain the data and therefore cannot fulfill the request—in other words, a cache miss

occurs—the request propagates down to the L2. If the L2 does not contain the

desired byte, the request begins the relatively long trip out to main memory.

If main memory doesn’t contain the data, you’re in big trouble, because then

it has to be
paged
in from the hard disk, an act that can take a relative eternity in CPU time.

Let’s assume that the requested byte is found in main memory. Once

located, the byte is copied from main memory, along with a bunch of its

neighboring bytes in the form of a
cache block
or
cache line
, into the L2 and L1. When the CPU requests this same byte again, it will be waiting for it

there in the L1, a situation called a
cache hit
.

Cache Misses

Computer architects usually divide cache misses up into three different types

depending on the situation that brought about the miss. I’ll introduce these

three types of misses at appropriate points over the course of the chapter,

but I can talk about the first one right now.

A
compulsory miss
is a cache miss that occurs because the desired data

was never in the cache and therefore must be paged in for the first time in

a program’s execution. It’s called a
compulsory
miss because, barring the use of certain specialized tricks like data prefetching, it’s the one type of miss

that just can’t be avoided. All cached data must be brought into the cache for

the very first time at some point, and the occasion for doing so is normally a

compulsory miss.

The two other types of cache misses are misses that result when the CPU

requests data that was previously in the cache but has been
evicted
for some reason or other. We’ll discuss evictions in
“Temporal and Spatial Locality

Revisited: Replacement/Eviction Policies and Block Sizes” on page 230,

and I’ll cover the other two types of cache misses as they come up through-

out the course of this chapter.

Understanding Caching and Performance

219

Locality of Reference

Caching works because of a very simple property exhibited to one degree

or another by all types of code and data: locality of reference. We generally

find it useful to talk about two types of locality of reference:
spatial locality
and
temporal locality
.

Spatial locality

Spatial locality is a fancy label for the general rule that
if the CPU needs an
item from memory at any given moment, it’s likely to need that item’s neighbors next
.

Temporal locality

Temporal locality is the name we give to the general rule that
if an item in
memory was accessed once, it’s likely to be accessed again in the near future
.

Depending on the type of application, both code and data streams can

exhibit spatial and temporal locality.

Spatial Locality of Data

Spatial locality of data is the easiest type of locality to understand, because

most of you have used media applications like MP3 players, DVD players, and

other types of applications whose datasets consist of large, ordered files. Con-

sider an MP3 file, which has a bunch of blocks of data that are consumed by

the processor in sequence from the file’s beginning to its end. If the CPU is

running iTunes and it has just requested second 1:23 of a five-minute MP3,

you can be reasonably certain that next it’s going to want seconds 1:24, 1:25,

and so on. This is the same with a DVD file, and with many other types of

media files like images, AutoCAD drawings, and 3D game levels. All of these

applications operate on large arrays of sequentially ordered data that get

ground through in sequence by the CPU again and again.

Business applications like word processors also have great spatial locality

for data. If you think about it, few people open six or seven documents in a

word processor and quickly alternate between them typing one or two words

in each. Most of us just open up one or two relatively modest-sized files and

work in them for a while without skipping around too much within the same

file. These files are stored in contiguous regions of memory, where they can

be brought quickly into the cache in a few large batches.

Ultimately, spatial locality is just a way of saying that related chunks of

data tend to clump together in memory, and since they’re related, they also

tend to be processed together in batches by the CPU.

In Figure 11-4, as in Figure 11-3, the red cells are related chunks of data

stored in the memory array. This figure shows a program with fairly good

spatial locality, since the red cells are clumped closely together. In an appli-

cation with poor spatial locality, the red cells would be more randomly

distributed among the unrelated blue cells.

220

Chapter 11

Spatial Locality of Code

Spatial locality applies to code just like it does to data—most well-written

code tries to avoid jumps and branches so that the processor can execute

through large contiguous blocks uninterrupted. Games, simulations, and

media processing applications tend to have decent spatial locality for code,

because such applications often feature small blocks of code (called
kernels
) operating repeatedly on very large datasets.

Figure 11-4: Spatial locality

When it comes to spatial locality of code for business applications, the

picture is mixed. If you think about the way that you use a word processor,

it’s easy to see what’s going on. As you create a document, most of you are

constantly pressing different formatting buttons and invoking different

menu options. For instance, you might format one word in italics, change

the paragraph spacing, and then save the file, all in sequence. Each of these

actions invokes a very different part of the code in a large application like

Microsoft Word; it’s not likely that the code for the FileSave menu option

is stored right next to the code that formats a font in italics. The way you

use a word processor forces the CPU to jump around from place to place

in memory in order to retrieve the correct code. However, the segment of

the code stream that implements each individual action (i.e., saving a file,

formatting a font, and so on) usually exists in memory as a fairly large, spatially localized chunk—very much like a little subprogram within the larger application. While the code for the FileSave menu action might be quite far

away from the code for the italics formatting option, both of these chunks

of code exhibit good spatial locality as small programs in their own right.

What this means for a cache designer is that business applications need

large instruction caches to be able to collect all of the most frequently used

clumps of code that correspond to the most frequently executed actions and

Understanding Caching and Performance

221

pack them together in the cache. If the instruction cache is too small, the

different clumps get
swapped
out as different actions are performed. If the instruction cache is large enough, many of these subprograms will fit and

there’s little swapping needed. Incidentally, this is why business applications

performed so poorly on Intel’s original cacheless Celeron processor.

Temporal Locality of Code and Data

Consider a simple Photoshop filter that inverts an image to produce a

negative; there’s a small piece of code that performs the same inversion on

each pixel, starting at one corner and going in sequence all the way across

and down to the opposite corner. This code is just a small loop that gets

executed repeatedly, once on each pixel, so it’s an example of code that is

reused again and again. Media applications, games, and simulations, because

they use lots of small loops that iterate through very large datasets, have

excellent temporal locality for code.

The same large, homogenous datasets that give media applications and

the like good temporal locality for code also given them extremely poor

temporal locality for data. Returning to the MP3 example, a music file is

usually played through once in sequence and none of its parts are repeated.

This being the case, it’s actually a waste to store any of that file in the data cache, since it’s only going to stop off there temporarily before passing

through to the CPU.

When an application, like the aforementioned MP3 player, fills up the

cache with data that doesn’t really need to be cached because it won’t be

used again and as a result winds up bumping out of the cache data that will

be reused, that application is said to “pollute” the cache. Media applications,

games, and the like are big cache polluters, which is why they weren’t too

affected by the original Celeron’s lack of cache. Because these applications’

data wasn’t going to be needed again any time soon, the fact that it wasn’t in

a readily accessible cache didn’t really matter.

The primary way in which caches take advantage of temporal locality is

probably obvious by this point: Caches provide a place to store code and data

that the CPU is currently working with. By
working with
, I mean that the CPU

has used it once and is likely to use it again. A group of related blocks of

code and/or data that the CPU uses and reuses over a period of time in

order to complete a task is called a
working set
. Depending on the size of a task’s working set and the number of operations it takes the CPU to complete

that task, spatial and temporal locality—and with them the cache hierarchy—

will afford a greater or lesser performance increase on that task.

Locality: Conclusions

One point that should be apparent from the preceding discussion is that

temporal locality implies spatial locality, but not vice versa. That is to say,

data that is reused is always related (and therefore collects into spatially

localized clumps in memory), but data that is related is not always reused.

An open text file is an example of reusable data that occupies a localized

region of memory, and an MP3 file is an example of non-reusable (or

streaming) data that also occupies a localized region of memory.

222

Chapter 11

The other thing that you probably noticed from this section is the fact

that memory access patterns for code and memory access patterns for data

are often very different within the same application. For instance, media

applications have excellent temporal locality for code but poor temporal

locality for data. This fact has inspired many cache designers to split the L1

into two regions—one for code and one for data. The code half of the cache

is called the instruction cache, or I-cache, while the data half of the cache is called the data cache, or D-cache. This partitioning can result in significant

performance gains, depending on the size of the cache, the types of appli-

cations normally run on the system, and a variety of other factors.

Cache Organization: Blocks and Block Frames

One way that caches take advantage of locality of reference is by loading data

from memory in large chunks. When the CPU requests a particular piece of

data from the memory subsystem, that piece gets fetched and loaded into the

L1 along with some of its nearest neighbors. The actual piece of data that was

requested is called the
critical word
, and the surrounding group of bytes that gets fetched along with it is called a cache line or cache block. By fetching

not only the critical word but also a group of its neighbors and loading

them into the cache, the CPU is prepared to take advantage of the fact that

Other books

Diary of a Mad First Lady by Dishan Washington
Coming Home by Mooney, B.L.
Honour Among Thieves by Jeffrey Archer
The Pandora Box by Lilly Maytree
Twisted by Andrew E. Kaufman
Funeral in Blue by Anne Perry
Nantucket by Nan Rossiter
The City Son by Samrat Upadhyay
Wilda's Outlaw by Velda Brotherton