Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
they’re committed together as a single fused micro-op.
The two types of instructions that can be translated into fused micro-ops
are the store instruction type and the load-op instruction type.
Fused Stores
Store instructions on
x
86 processors, including not only Intel processors but also those from AMD, are broken down during the decode phase into two
micro-ops: a store-address micro-op and a store-data micro-op. The
store-
address
micro-op is the instruction that tells the address generation hardware to calculate the address in memory where the data is to be stored. This micro-op is sent to the
store-address execution unit
in the back end’s load-store unit (LSU) for execution. The
store-data
micro-op is the instruction that writes the data to be stored into the outgoing
store-data buffer
. From the store data buffer, the data will be written out to memory when the store instruction
commits; this micro-op is executed by the
store-data execution unit
, which is also located in the LSU.
242
Chapter 12
The Pentium M’s instruction decoding hardware decodes the store
operation into two separate micro-ops, but it then
fuses
these two micro-ops together before writing them to a single, shared entry in the micro-op queue.
As noted earlier, the instructions remain fused until they’re issued through
an issue port to the actual store unit, at which point they’re treated separately by the back end. Because the store-address and store-data operations are
inherently parallel and are performed by two separate execution units on
two separate issue ports, these two micro-ops can issue and execute in
parallel—the data can be written to the store buffer at the same time that
the store address is being calculated. When both micro-ops have completed
execution, the core’s commitment unit again treats them as if they are fused.
Fused Loads
A
load-op
, or
read-modify
instruction, is exactly what it sounds like: a two-part instruction that loads data from memory into a register and then performs
an operation on that data. Such instructions are broken down by the decod-
ing hardware into two micro-ops: a load micro-op that’s issued to the load
execution unit and is responsible for calculating the source address and
then loading the needed data, and a second micro-op that performs some
type of operation on the loaded data and is executed by the appropriate
execution unit.
Load-op instructions are treated in much the same way as store instruc-
tions with respect to decoding, fusion, and execution. The load micro-op is
fused with the second micro-op, and the two are tracked as a single micro-op
in the processor’s instruction window. As with the fused stores described
earlier, the two constituent parts of the fused micro-op issue and execute
separately before being committed together.
Note that unlike the store-address and store-data micro-ops that make up
a fused store, the two parts of a fused load-op instruction are inherently serial, because the load operation must be executed first. Thus the two parts of the
fused load-op must be executed in sequence.
The Impact of Micro-ops Fusion
Intel claims that micro-ops fusion on the Pentium M reduces the number of
micro-ops in the instruction window by over 10 percent. Fewer micro-ops in-
flight means that a larger number of instructions can be tracked with the same
number of ROB and RS entries. Thus the Pentium M’s re-order, issue, and
commit width is effectively larger than the number of ROB and RS entries
alone would suggest. The end result is that, compared to its predecessors,
the Pentium M can bring out more performance from the same amount of
instruction tracking hardware, a feature that gives the processor more
performance per watt of power dissipated.
As you might expect, the Pentium M sees the most benefit from micro-
ops fusion during sustained bursts of memory operations. During long
stretches of memory traffic, all three of the Pentium M’s decoders are able
to work in parallel to process incoming memory instructions, thereby tripling
the decode bandwidth of the older P6 core. Intel estimates that this improved
Intel’s Pentium M, Core Duo, and Core 2 Duo
243
memory instruction decode bandwidth translates into a 5 percent perform-
ance boost for integer code and a 9 percent boost for floating-point code,
the former benefiting more from store fusion and the latter benefiting
equally from store and load-op fusion.
NOTE
You may have noticed that the Pentium M’s micro-ops fusion feature works remarkably
like the PowerPC 970’s instruction grouping scheme, insofar as both processors bind
translated instructions together in the decode phase before dispatching them as a group
to the back end. The analogy between the Pentium M’s micro-ops fusion feature and the
970’s group dispatch feature isn’t perfect, but it is striking, and both features have a
similarly positive effect on performance and power consumption.
Branch Prediction
As you learned in “Caching Basics” on page 215,
the ever-growing distance (in CPU cycles) between main memory and the CPU means that precious
transistor resources spent on branch prediction hardware continue to give
an ever larger return on investment. For reasons of both performance and
power efficiency, Intel spent quite a few transistors on the Pentium M’s
branch predictor.
The Pentium M’s branch predictor is one place where the newer design
borrows from the Pentium 4 instead of the P6. The Pentium M adopts the
Pentium 4’s already powerful branch prediction scheme and expands on it
by adding two new features: a loop detector and an indirect predictor.
The Loop Detector
One of the most common types of branches that a processor encounters is
the exit condition of a loop. In fact, loops are so common that the static
method of branch prediction, in which all branches are assumed to be loop
exit conditions that evaluate to
taken
, works reasonably well for processors with shallow pipelines.
One problem with static branch predictors is that they always make a
wrong prediction on the final iteration of the loop—the iteration on which
the branch evaluates to
not taken
—thereby forcing a pipeline stall as the processor recovers from the erroneous prediction. The other, more important
problem with static prediction is that it works poorly for non-loop branches,
like standard if-then conditionals. In such branches, a static prediction of
taken
is roughly the equivalent of a coin toss.
Dynamic predictors, like the Pentium 4’s branch predictor, fix this
shortcoming by keeping track of the execution history of a particular branch
instruction in order to give the processor a better idea of what its outcome
on the current pass will probably be. The bigger the table used to track the
branch’s history, the more data the branch predictor has to work with and
the more accurate its predictions can be. However, even a relatively sizable
branch history table (BHT) like that of the Pentium 4 doesn’t have enough
space to store all the relevant execution history information on the loop
244
Chapter 12
branches, since they tend to take a very large number of iterations. Therefore
loops that go through many iterations will always be mispredicted by a
standard dynamic branch predictor.
The Pentium M’s
loop detector
addresses this problem by analyzing the
branches as they execute, in order to identify which branches are loop exit
conditions. For each branch that the detector thinks is a loop exit condition,
the branch prediction hardware initializes a special set of counters in the
predictor table to keep track of how many times the loop actually iterates.
If loops with even fairly large numbers of iterations always tend to iterate the same number of times, then the Pentium M’s branch predictor can predict
their behavior with 100 percent accuracy.
In sum, the loop detector plugs into the Pentium M’s P4-style branch
predictor and augments it by providing it with extra, more specialized data
on the loops of the currently executing program.
The Indirect Predictor
The second type of specialized branch predictor that the Pentium M uses is
the indirect predictor. As you learned in Chapter 1, branches come in two
flavors: direct and indirect. Direct branches have the branch target explicitly
specified in the instruction, which means that the branch target is fixed at
load time. Indirect branches, on the other hand, have to load the branch
target from a register, so they can have multiple potential targets. Storing
these potential targets is the function of the branch target buffer (BTB)
described in Chapter 5.
Direct branches are the easiest to predict and can often be predicted
with upward of 97 percent accuracy. Indirect branches, in contrast, are
notoriously difficult to predict, and some research puts indirect branch
prediction using the standard BTB method at around 75 percent accuracy.
The Pentium M’s
indirect predictor
works a little like the branch history table that I’ve described, but instead of storing information about whether or
not a particular branch was taken the past few times it was executed, it stores
information about each indirect branch’s favorite target addresses—the
targets to which a particular branch usually likes to jump and the conditions
under which it likes to jump to them. So the Pentium M’s indirect branch
predictor knows that a particular indirect branch in the BHT with a specific
set of favorite target addresses stored in the BTB tends to jump to one target
address under
this
set of conditions, while under
that
set of conditions, it likes to jump to another.
Intel claims that the combination of the loop detector and indirect
branch predictor gives the Pentium M a 20 percent increase in overall branch
prediction accuracy, resulting in a 7 percent real performance increase.
Improved branch prediction gives the Pentium M a leg up not only in
performance but also in power efficiency. Because of its improved branch
prediction capabilities, the Pentium M wastes fewer cycles and less energy
speculatively executing code that it will then have to throw away once it
learns that it mispredicted a branch.
Intel’s Pentium M, Core Duo, and Core 2 Duo
245
The Stack Execution Unit
Another feature that Intel introduced with the Pentium M is the
stack
execution unit
, a piece of hardware that’s designed to reduce the number of in-flight micro-ops that the processor needs to keep track of.
The
x
86 ISA includes stack-manipulation instructions like pop, push, ret, and call, for use in passing parameters to functions in function calls. During
the course of their execution, these instructions update
x
86’s dedicated
stack
pointer register
, ESP. In the Netburst and P6 microarchitectures, this update was carried out by a special micro-op, which was generated by the decoder and
charged with using the integer execution units to update ESP by adding to it
or subtracting from it as necessary.
The Pentium M’s dedicated stack execution unit eliminates these special
ESP-updating micro-ops by monitoring the decoder’s instruction stream for
incoming stack instructions and keeping track of those instructions’ changes
to ESP. Updates to ESP are handled by a dedicated adder attached to the stack
execution unit instead of by the integer execution units, as in previous designs.
Because the Pentium M’s front end has dedicated hardware for tracking the
state of ESP and keeping it updated, there’s no need to issue those extra ESP-
related micro-ops to the back end.
This technique has a few benefits. The obvious benefit is that it reduces
the number of in-flight micro-ops, which means fewer micro-ops and less
power consumed per task. Then, because there are fewer integer micro-ops
in the back end, the integer execution units are free to process other instruc-
tions, since they don’t have to deal with the stack-related ESP updates.
Pipeline and Back End
The exact length of the Pentium M’s pipeline has never been publicly
disclosed, but Intel has stated that it is slightly longer the older P6’s 12-stage pipeline. One or two new pipeline stages were added to the Pentium M’s
front end phase for timing purposes, with the result that the newer processor
can run at a higher clockspeed than its predecessor.
Details of the Pentium M’s back end are also scarce, but it is alleged to be
substantially the same as that of the Pentium III. (See Chapter 5 for details of the PIII’s back end.)
Summary: The Pentium M in Historical Context
The Pentium M started out as a processor intended solely for mobile devices,
but it soon became clear to Intel that this much-improved version of the P6
microarchitecture had far more performance-per-watt potential than the