Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (93 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture

5.74Mb size Format: txt, pdf, ePub

Read Book Download Book

they’re committed together as a single fused micro-op.

The two types of instructions that can be translated into fused micro-ops

are the store instruction type and the load-op instruction type.

Fused Stores

Store instructions on
x
86 processors, including not only Intel processors but also those from AMD, are broken down during the decode phase into two

micro-ops: a store-address micro-op and a store-data micro-op. The
store-

address
micro-op is the instruction that tells the address generation hardware to calculate the address in memory where the data is to be stored. This micro-op is sent to the
store-address execution unit
in the back end’s load-store unit (LSU) for execution. The
store-data
micro-op is the instruction that writes the data to be stored into the outgoing
store-data buffer
. From the store data buffer, the data will be written out to memory when the store instruction

commits; this micro-op is executed by the
store-data execution unit
, which is also located in the LSU.

242

Chapter 12

The Pentium M’s instruction decoding hardware decodes the store

operation into two separate micro-ops, but it then
fuses
these two micro-ops together before writing them to a single, shared entry in the micro-op queue.

As noted earlier, the instructions remain fused until they’re issued through

an issue port to the actual store unit, at which point they’re treated separately by the back end. Because the store-address and store-data operations are

inherently parallel and are performed by two separate execution units on

two separate issue ports, these two micro-ops can issue and execute in

parallel—the data can be written to the store buffer at the same time that

the store address is being calculated. When both micro-ops have completed

execution, the core’s commitment unit again treats them as if they are fused.

Fused Loads

A
load-op
, or
read-modify
instruction, is exactly what it sounds like: a two-part instruction that loads data from memory into a register and then performs

an operation on that data. Such instructions are broken down by the decod-

ing hardware into two micro-ops: a load micro-op that’s issued to the load

execution unit and is responsible for calculating the source address and

then loading the needed data, and a second micro-op that performs some

type of operation on the loaded data and is executed by the appropriate

execution unit.

Load-op instructions are treated in much the same way as store instruc-

tions with respect to decoding, fusion, and execution. The load micro-op is

fused with the second micro-op, and the two are tracked as a single micro-op

in the processor’s instruction window. As with the fused stores described

earlier, the two constituent parts of the fused micro-op issue and execute

separately before being committed together.

Note that unlike the store-address and store-data micro-ops that make up

a fused store, the two parts of a fused load-op instruction are inherently serial, because the load operation must be executed first. Thus the two parts of the

fused load-op must be executed in sequence.

The Impact of Micro-ops Fusion

Intel claims that micro-ops fusion on the Pentium M reduces the number of

micro-ops in the instruction window by over 10 percent. Fewer micro-ops in-

flight means that a larger number of instructions can be tracked with the same

number of ROB and RS entries. Thus the Pentium M’s re-order, issue, and

commit width is effectively larger than the number of ROB and RS entries

alone would suggest. The end result is that, compared to its predecessors,

the Pentium M can bring out more performance from the same amount of

instruction tracking hardware, a feature that gives the processor more

performance per watt of power dissipated.

As you might expect, the Pentium M sees the most benefit from micro-

ops fusion during sustained bursts of memory operations. During long

stretches of memory traffic, all three of the Pentium M’s decoders are able

to work in parallel to process incoming memory instructions, thereby tripling

the decode bandwidth of the older P6 core. Intel estimates that this improved

Intel’s Pentium M, Core Duo, and Core 2 Duo

243

memory instruction decode bandwidth translates into a 5 percent perform-

ance boost for integer code and a 9 percent boost for floating-point code,

the former benefiting more from store fusion and the latter benefiting

equally from store and load-op fusion.

NOTE

You may have noticed that the Pentium M’s micro-ops fusion feature works remarkably
like the PowerPC 970’s instruction grouping scheme, insofar as both processors bind
translated instructions together in the decode phase before dispatching them as a group
to the back end. The analogy between the Pentium M’s micro-ops fusion feature and the
970’s group dispatch feature isn’t perfect, but it is striking, and both features have a
similarly positive effect on performance and power consumption.

Branch Prediction

As you learned in “Caching Basics” on page 215,
the ever-growing distance (in CPU cycles) between main memory and the CPU means that precious

transistor resources spent on branch prediction hardware continue to give

an ever larger return on investment. For reasons of both performance and

power efficiency, Intel spent quite a few transistors on the Pentium M’s

branch predictor.

The Pentium M’s branch predictor is one place where the newer design

borrows from the Pentium 4 instead of the P6. The Pentium M adopts the

Pentium 4’s already powerful branch prediction scheme and expands on it

by adding two new features: a loop detector and an indirect predictor.

The Loop Detector

One of the most common types of branches that a processor encounters is

the exit condition of a loop. In fact, loops are so common that the static

method of branch prediction, in which all branches are assumed to be loop

exit conditions that evaluate to
taken
, works reasonably well for processors with shallow pipelines.

One problem with static branch predictors is that they always make a

wrong prediction on the final iteration of the loop—the iteration on which

the branch evaluates to
not taken
—thereby forcing a pipeline stall as the processor recovers from the erroneous prediction. The other, more important

problem with static prediction is that it works poorly for non-loop branches,

like standard if-then conditionals. In such branches, a static prediction of

taken
is roughly the equivalent of a coin toss.

Dynamic predictors, like the Pentium 4’s branch predictor, fix this

shortcoming by keeping track of the execution history of a particular branch

instruction in order to give the processor a better idea of what its outcome

on the current pass will probably be. The bigger the table used to track the

branch’s history, the more data the branch predictor has to work with and

the more accurate its predictions can be. However, even a relatively sizable

branch history table (BHT) like that of the Pentium 4 doesn’t have enough

space to store all the relevant execution history information on the loop

244

Chapter 12

branches, since they tend to take a very large number of iterations. Therefore

loops that go through many iterations will always be mispredicted by a

standard dynamic branch predictor.

The Pentium M’s
loop detector
addresses this problem by analyzing the

branches as they execute, in order to identify which branches are loop exit

conditions. For each branch that the detector thinks is a loop exit condition,

the branch prediction hardware initializes a special set of counters in the

predictor table to keep track of how many times the loop actually iterates.

If loops with even fairly large numbers of iterations always tend to iterate the same number of times, then the Pentium M’s branch predictor can predict

their behavior with 100 percent accuracy.

In sum, the loop detector plugs into the Pentium M’s P4-style branch

predictor and augments it by providing it with extra, more specialized data

on the loops of the currently executing program.

The Indirect Predictor

The second type of specialized branch predictor that the Pentium M uses is

the indirect predictor. As you learned in Chapter 1, branches come in two

flavors: direct and indirect. Direct branches have the branch target explicitly

specified in the instruction, which means that the branch target is fixed at

load time. Indirect branches, on the other hand, have to load the branch

target from a register, so they can have multiple potential targets. Storing

these potential targets is the function of the branch target buffer (BTB)

described in Chapter 5.

Direct branches are the easiest to predict and can often be predicted

with upward of 97 percent accuracy. Indirect branches, in contrast, are

notoriously difficult to predict, and some research puts indirect branch

prediction using the standard BTB method at around 75 percent accuracy.

The Pentium M’s
indirect predictor
works a little like the branch history table that I’ve described, but instead of storing information about whether or

not a particular branch was taken the past few times it was executed, it stores

information about each indirect branch’s favorite target addresses—the

targets to which a particular branch usually likes to jump and the conditions

under which it likes to jump to them. So the Pentium M’s indirect branch

predictor knows that a particular indirect branch in the BHT with a specific

set of favorite target addresses stored in the BTB tends to jump to one target

address under
this
set of conditions, while under
that
set of conditions, it likes to jump to another.

Intel claims that the combination of the loop detector and indirect

branch predictor gives the Pentium M a 20 percent increase in overall branch

prediction accuracy, resulting in a 7 percent real performance increase.

Improved branch prediction gives the Pentium M a leg up not only in

performance but also in power efficiency. Because of its improved branch

prediction capabilities, the Pentium M wastes fewer cycles and less energy

speculatively executing code that it will then have to throw away once it

learns that it mispredicted a branch.

Intel’s Pentium M, Core Duo, and Core 2 Duo

245

The Stack Execution Unit

Another feature that Intel introduced with the Pentium M is the
stack

execution unit
, a piece of hardware that’s designed to reduce the number of in-flight micro-ops that the processor needs to keep track of.

The
x
86 ISA includes stack-manipulation instructions like pop, push, ret, and call, for use in passing parameters to functions in function calls. During

the course of their execution, these instructions update
x
86’s dedicated
stack
pointer register
, ESP. In the Netburst and P6 microarchitectures, this update was carried out by a special micro-op, which was generated by the decoder and

charged with using the integer execution units to update ESP by adding to it

or subtracting from it as necessary.

The Pentium M’s dedicated stack execution unit eliminates these special

ESP-updating micro-ops by monitoring the decoder’s instruction stream for

incoming stack instructions and keeping track of those instructions’ changes

to ESP. Updates to ESP are handled by a dedicated adder attached to the stack

execution unit instead of by the integer execution units, as in previous designs.

Because the Pentium M’s front end has dedicated hardware for tracking the

state of ESP and keeping it updated, there’s no need to issue those extra ESP-

related micro-ops to the back end.

This technique has a few benefits. The obvious benefit is that it reduces

the number of in-flight micro-ops, which means fewer micro-ops and less

power consumed per task. Then, because there are fewer integer micro-ops

in the back end, the integer execution units are free to process other instruc-

tions, since they don’t have to deal with the stack-related ESP updates.

Pipeline and Back End

The exact length of the Pentium M’s pipeline has never been publicly

disclosed, but Intel has stated that it is slightly longer the older P6’s 12-stage pipeline. One or two new pipeline stages were added to the Pentium M’s

front end phase for timing purposes, with the result that the newer processor

can run at a higher clockspeed than its predecessor.

Details of the Pentium M’s back end are also scarce, but it is alleged to be

substantially the same as that of the Pentium III. (See Chapter 5 for details of the PIII’s back end.)

Summary: The Pentium M in Historical Context

The Pentium M started out as a processor intended solely for mobile devices,

but it soon became clear to Intel that this much-improved version of the P6

microarchitecture had far more performance-per-watt potential than the