Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (66 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
2.48Mb size Format: txt, pdf, ePub

is permanently altered, that instruction is said to
commit
. Instructions must commit in program order if the illusion of sequential execution is to be maintained. This means that no instruction can commit until all of the instructions

that were originally ahead of it in the code stream have committed.

The requirement that all instructions must commit in their original

program order is what necessitates the second buffer shown in Figure 5-8.

The processor needs a place to collect instructions as they complete the out-

of-order execution phase of their lifecycle, so that they can be put back in

their original order before being sent to the final write stage, where they’re

committed. Like the issue buffer described earlier, this completion buffer

can take a number of forms. We’ll look at the form that this buffer takes in

the P6 shortly.

I stated previously that an instruction sits in the completion phase’s

buffer, which I’ll call the
completion buffer
for now, and waits to have its result written back to the register file. But where does the instruction’s result wait

during this interim period? When an instruction is executed out of order, its

result goes into a special rename register that has been allocated especially

for use by that instruction. Note that this rename register is part of the

processor’s internal bookkeeping apparatus, which means it is not a part of

the programming model and is therefore not visible to the programmer. The

result waits in this hidden rename register until the instruction commits, at

which time the result is written from the rename register into the programmer-

visible architectural register file. After the instruction’s result is committed, the rename register then goes back into the pool of available rename

registers, where it can be assigned to another instruction on a later cycle.

The P6’s Issue Phase: The Reservation Station

The P6 microarchitecture feeds each newly decoded instruction into a buffer

called the
reservation station (RS)
, where it waits until all of its execution requirements are met. Once they’ve been met, the instruction then moves out of the

reservation station into an execution unit (i.e., it is issued), where it executes.

98

Chapter 5

A glance at the P6 diagram (Figure 5-6) shows that up to three instruc-

tions per cycle can be dispatched from the decoders into the reservation

station. And as you’ll see shortly, up to five instructions per cycle can be

issued from the reservation station into the execution units. Thus the

Pentium’s original superscalar design, in which two instructions per cycle

could dispatch from the decoders directly into the back end, has been

replaced with a buffered design in which three instructions can dispatch

into the buffer and five instructions can issue out of it on any given cycle.

This buffering action, and the decoupling of the front end’s fetch/

decode bandwidth from the back end’s execution bandwidth that it enables,

are at the heart of the P6’s performance gains.

The P6’s Completion Phase: The Reorder Buffer

Because the P6 microarchitecture must commit its instructions in order, it

needs a place to keep track of the original program order of each instruction

that enters the reservation station. Therefore, after the instructions are

decoded, they must travel through the
reorder buffer (ROB)
before flowing into the reservation station. The ROB is like a large logbook in which the P6 can

record all the essential information about each instruction that enters the

out-of-order back end. The primary function of the ROB is to ensure that

instructions come out the other side of the out-of-order back end in the same

order in which they entered it. In other words, it’s the reservation station’s

job to see that instructions are executed in the most optimal order, even if

that means executing them out of program order. It’s the reorder buffer’s

job to ensure that the finished instructions get put back in program order

and that their results are written to the architectural register file in the

proper sequence. To this end, the ROB stores data about each instruction’s

status, operands, register needs, original place in the program, and so on.

So newly decoded instructions flow into the ROB, where their relevant

information is logged in one of 40 available entries. From there, they pass on

to the reservation station, and then on to the back end. Once they’re done

executing, they wait in the ROB until they’re ready to be committed.

The role I’ve just described for the reorder buffer should be familiar to

you at this point. The reorder buffer corresponds to the structure that I

called the
completion buffer
earlier, but with a few extra duties assigned to it.

If you look at my diagram of the P6 microarchitecture, you’ll notice that

the reorder buffer is depicted in two spots: the front end and the commit unit.

This is because the ROB is active in both of these phases of the instruction’s

lifecycle. The ROB is tasked with tracking instructions as they move through

the phases of their lifecycle and with putting the instructions back in program

order at the end of their lifecycle. So newly decoded instructions must be

given a tracking entry in the ROB and have a temporary rename register

allocated for their private use. Similarly, newly executed instructions must

wait in the ROB before they can commit by having the contents of the

temporary rename register that holds their result permanently written to

the architectural register file.

The Intel Pentium and Pentium Pro

99

As implied in the previous sentence, not only does the P6’s ROB act as a

completion buffer and an instruction tracker, but it also handles register

renaming. Each of the P6 microarchitecture’s 40 ROB entries has a
data field
that holds program data just like an
x
86 register. These fields give the P6’s back end 40 microarchitectural rename registers to work with, and they’re

used in combination with the P6’s
register allocation table (RAT)
to implement register renaming in the P6 microarchitecture.

The Instruction Window

The reservation station and the reorder buffer together make up the heart

of the P6’s out-of-order back end, and they account for its drastic clock-for-

clock performance advantage over the original Pentium. These two buffers—

the one for reshuffling and optimizing the code stream (the RS) and the

other for unshuffling and reordering the code stream (the ROB)—enable

the P6 processor to dynamically and intelligently adapt its operation to fit the needs of the ever-changing code stream.

A common metaphor for thinking about and talking about the P6’s RS +

ROB combination, or analogous structures on other processors, is that of an

instruction window. The P6’s ROB can track up to 40 instructions in various

stages of execution, and its reservation station can hold and examine up to

20 instructions to determine the optimal time for them to execute. Think of

the reservation station’s 20-instruction buffer as a window that moves along the sequentially ordered code stream; on any given cycle, the P6 is looking through

this window at that visible segment of the code stream and thinking about

how its hardware can optimally execute the 20 or so instructions that it sees

there.

A good analogy for this is the game of Tetris, where a small preview

window shows you the next piece that will come your way while you’re

deciding how best to place the currently falling piece. Thus at any given

moment, you can see a total of two Tetris pieces and think about how those

two should fit with the pieces that have gone before and those that might

come after.

The P6 microarchitecture’s job is a little harder than the average Tetris

player’s, because it must maneuver and optimally place as many as three

falling pieces at a time; hence it needs to be able to see farther ahead into

the future in order to make the best decisions about what to place where and

when. The P6’s wider instruction window allows the processor to look further

ahead in the code stream and to juggle its instructions so that they fit together with the currently available execution resources in the optimal manner.

The P6 Pipeline

The P6 has a 12-stage pipeline that’s considerably longer than the Pentium’s

five-stage pipeline. I won’t enumerate and describe all 12 stages individually,

but I will give a general overview of the phases that the P6’s pipeline passes

through.

100

Chapter 5

BTB access and instruction fetch

The first three-and-a-half pipeline stages are dedicated to accessing the

branch target buffer and fetching the next instruction. The P6’s two-

cycle instruction fetch phase is longer than the Pentium’s one-cycle fetch

phase, but it keeps the L1 cache access latency from holding back the

clock speed of the processor as a whole.

Decode

The next two-and-a-half stages are dedicated to decoding
x
86 instructions and breaking them down into the P6’s internal, RISC-like instruction

format. We’ll discuss this instruction set translation, which takes place

in all modern
x
86 processors and even in some RISC processors, in

more detail shortly.

Register rename

This stage takes care of register renaming and logging instructions in

the ROB.

Write to RS

Writing instructions from the ROB into the RS takes one cycle, and it

occurs in this stage.

Read from RS

At this point, the issue phase of the instruction’s lifecycle is under way.

Instructions can sit in the RS for an unspecified number of cycles before

being read from the RS. Even if they’re read from the RS immediately

after entering it, it takes one cycle to move instructions out of the RS,

through the
issue ports
and into the execution units.

Execute

Instruction execution can take one cycle, as in the case of simple

integer instructions, or multiple cycles, as in the case of floating-point

instructions.

Commit

These two final cycles are dedicated to writing the results of the instruc-

tion execution back into the ROB, and then committing the instructions

by writing their results from the ROB into the architectural register file.

Lengthening the P6’s pipeline as described in this chapter has two primary

beneficial effects. First, it allows Intel to crank up the processor’s clock speed, since each of the stages is shorter and simpler and can be completed quicker.

The second effect is a little more subtle and less widely appreciated.

The P6’s longer pipeline, when combined with its buffered decoupling

of fetch/decode bandwidth from execution bandwidth, allows the processor

to hide hiccups in the fetch and decode stages. In short, the nine pipeline

stages that lie ahead of the execute stage combine with the RS to form a deep

buffer for instructions. This buffer can hide gaps and hang-ups in the flow

of instructions in much the same way that a large water reservoir can hide

interruptions in the flow of water to a facility.

The Intel Pentium and Pentium Pro

101

But on the downside (to continue the water reservoir example), when

one dead animal is spotted floating in the reservoir, the whole thing has to

be flushed. This is sort of the case with the P6 and a branch misprediction.

Branch Prediction on the P6

The P6’s architects expended considerably more resources than its prede-

cessor on branch prediction and managed to boost dynamic branch predic-

tion accuracy from the Pentium’s approximately 75 percent rate to upwards

of 90 percent. The P6 has a 512-entry BHT + BTB, and it uses four bits to

record branch history information (compared to the Pentium’s two-bit

predictor). The four-bit prediction scheme allows the Pentium to store more

of each branch’s history, thereby increasing its ability to correctly predict

branch outcomes.

As you learned in Chapter 2, branch prediction gets more important as

pipelines get longer, because a pipeline flush due to a misprediction means

more lost cycles and a longer recovery time for the processor’s instruction

throughput and completion rate.

Consider the case of a conditional branch whose outcome depends on

the result of an integer calculation. On the original Pentium, the calculation

happens in the fourth pipeline stage, and if the branch prediction unit

(BPU) has guessed incorrectly, only three cycles worth of work would be lost

in the pipeline flush. On the P6, though, the conditional calculation isn’t

performed until stage 10, which means 10 cycles worth of work get flushed if

the BPU guesses incorrectly.

When a dynamically scheduled processor executes instructions specula-

tively, those speculative instructions and their results are stored in the ROB

just like non-speculative instructions. However, the ROB entries for the spec-

ulative instructions are marked as speculative and prevented from committing

Other books

Visions of Heat by Nalini Singh
Playing the Game by Simon Gould
Motocross Me by Cheyanne Young
Urban Prey by S. J. Lewis
Chasing Destiny by J.D. Rivera
Learning to Fly by Misha Elliott
The Sexy Vegan Cookbook by Brian L. Patton
So Much More by Adams, Elizabeth