Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (75 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
10.56Mb size Format: txt, pdf, ePub

Unit

Decode/Dispatch

Vector Issue

FP Issue

General Issue

Queue

Queue

Queue

RS

RS

RS

RS

RS

RS

RS

RS

RS

RS

RS

RS

VPU-1

VSIU-1

VCIU-1 VFPU-1

FPU-1

IU1a-1

IU2-1

LSU-1

VPU-2

Finish

VCIU-2 VFPU-2

FPU-2

Finish

LSU-2

VCIU-3 VFPU-3

FPU-3

Vector

Load-

VCIU-4 VFPU-4

FPU-4

Permute

Store

Unit

FPU-5

Integer Unit

Unit

Finish

Vector ALU

FPU

Memory Access

Vector Arithmetic Logic Units

Scalar Arithmetic Logic Units

Units

Back End

Completion

Queue

Write

Commit Unit

Figure 7-4: The basic microarchitecture of the G4e

144

Chapter 7

Before instructions can enter the G4e’s pipeline, they have to be avail-

able in its 32KB instruction cache. This instruction cache, together with the

32KB data cache, makes up the G4e’s 64KB L1 cache. An instruction leaves

the L1 and goes down through the various front-end stages until it hits the

back end, at which point it’s executed by one of the G4e’s eight execution

units (not counting the branch execution unit, which we’ll talk about in a

second).

As I’ve already noted, the G4e breaks down the G4’s classic, four-stage

pipeline into seven, shorter stages:

G4

G4e

1

Fetch

1

Fetch-1

2

Fetch-2

2

Decode/dispatch

3

Decode/dispatch

4

Issue

3

Execute

5

Execute

6

Complete

4

Write-back

7

Write-back (Commit)

Notice that the G4e dedicates one pipeline stage each to the character-

istic issue and complete phases that bracket the out-of-order execution phase

of a dynamically scheduled instruction’s lifecycle.

Let’s take a quick look at the basic pipeline stages of the G4e, because

this will highlight some of the ways in which the G4e differs from the original

G4. Also, an understanding of the G4e’s more classic RISC pipeline will

provide you with a good foundation for our upcoming discussion of the

Pentium 4’s much longer, more peculiar pipeline.

Stages 1 and 2: Instruction Fetch

These two stages are both dedicated primarily to grabbing an instruction

from the L1 cache. Like its predecessor, the G4, the G4e can fetch up to four

instructions per clock cycle from the L1 cache and send them on to the next

stage. Hopefully, the needed instructions are in the L1 cache. If they aren’t

in the L1 cache, the G4e has to hit the much slower L2 cache to find them,

which can add up to nine cycles of delay into the instruction pipeline.

Stage 3: Decode/Dispatch

Once an instruction has been fetched, it goes into the G4e’s 12-entry instruc-

tion queue to be decoded. Once instructions are decoded, they’re dispatched

at a rate of up to three non-branch instructions per cycle to the proper
issue
queue
.

Note that the G4e’s dispatch logic dispatches instructions to the issue

queues in accordance with
“The Four Rules of Instruction Dispatch” on

page 127.
The only modification to the rules is in the issue buffer rule; instead Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies

145

of requiring that the proper execution unit and reservation station be

available before an instruction can be dispatched, the G4e requires that

there be space in one of the three issue queues.

Stage 4: Issue

The issue stage is the place where the G4e differs the most from the G4.

Specifically, the presence of the G4e’s three issue queues endows it with

power and flexibility that the G4 lacks.

As you learned in Chapter 6, instructions can stall in the original G4’s

dispatch stage if there is no execution unit available to take them. The G4e

eliminates this potential dispatch stall condition by placing a set of buffers,

called issue queues, in between the dispatch stage and the reservation

stations. On the G4e, it doesn’t matter if the execution units are busy and

their reservation stations are full; an instruction can still dispatch to the

back end if there is space in the proper issue queue.

The six-entry
general issue queue (GIQ)
feeds the integer ALUs and can

accept up to three instructions per cycle from the dispatch unit. It can also

issue up to three instructions per cycle
out of order
from its bottommost three entries to any of the G4e’s three integer units or to its LSU.

The four-entry
vector issue queue (VIQ)
can accept up to two instructions per cycle from the dispatch unit, and it can issue up to two instructions per

cycle from its bottommost two entries to any two of the four vector execution

units. But note that unlike the GIQ, instructions must issue
in order
from the bottom of the VIQ.

Finally, the single-entry
floating-point issue queue (FIQ)
can accept one instruction per cycle from the dispatch unit, and it can issue one instruction

per cycle to the FPU.

With the help of the issue queues, the G4e’s dispatcher can keep

dispatching instructions and clearing the instruction queue, even if the

execution units and their attached reservation stations are full. Further-

more, the GIQ’s out-of-order issue ability allows integer and memory

instructions in the code stream to flow around instructions that are stalled in

the execute phase, so that a stalled instruction doesn’t back up the pipeline

and cause pipeline bubbles. For example, if a multicycle integer instruction

is stalled in the bottom GIQ entry because the complex integer unit is busy,

single-cycle integer instructions and load/store instructions can continue to

issue to the simple integer units and the LSU from the two slots behind the

stalled instruction.

Stage 5: Execute

The execute stage is pretty straightforward. Here, the instructions pass from

the reservation stations into the execution units to be executed. Floating-

point instructions move into the floating-point execution unit, vector instruc-

tions move into one of the four AltiVec units, integer instructions move into

one of the G4e’s four integer execution units, and memory accesses move

into the LSU. We’ll talk about these units in a bit more detail when we

discuss the G4e’s back end.

146

Chapter 7

Stages 6 and 7: Complete and Write-Back

In these two stages, the instructions enter the completion queue to be put

back into program order, and their results are written back to the register

file. It’s important that the instructions are rearranged to reflect their

original ordering so that the illusion of in-order execution is maintained.

The user needs to think that the program’s commands were executed one

after the other, the way they were written.

Branch Prediction on the G4e and Pentium 4

The G4e and the Pentium 4 each use both static and dynamic branch pre-

diction techniques to prevent mispredictions and branch delays. If a branch

instruction does not have an entry in the BHT, both processors will use static

prediction to decide which path to take. If the instruction does have a BHT

entry, dynamic prediction is used. The Pentium 4’s BHT is quite large;

at 4,000 entries, it has enough space to store information on most of the

branches in an average program.

The earlier PIII’s branch predictor had a success rate of around 91 per-

cent, and the Pentium 4 allegedly uses an even more advanced algorithm

to predict branches, so it should perform even better. The Pentium 4 also

uses a BTB to store predicted branch targets. Note that in most of Intel’s

literature and diagrams, the BTB and BHT are combined under the label

the front-end BTB
.

The G4e has a BHT size of 2,000 entries, up from 512 entries in the

original G4. I don’t have any data on the G4e’s branch prediction success

rate, but I’m sure it’s fairly good. The G4e has a 128-entry BTIC, which is

twice as large as the original G4’s 64-entry BTIC. The G4e’s BTIC stores the

first four instructions in the code stream starting at each branch target, so it goes even further than the original G4 in preventing branch-related pipeline

bubbles.

Because of its long pipeline, the Pentium 4 has a
minimum misprediction

penalty
of 20 clock cycles for code that’s in the L1 cache—that’s the minimum, but the damage can be much worse, especially if the correct branch can’t be

found in the L1 cache. (In such a scenario, the penalty is upward of 30 cycles.) The G4e’s seven-stage pipeline doesn’t pay nearly as high of a price for misprediction as the Pentium 4, but it does take more of a hit than its four-stage

predecessor, the G4. The G4e has a minimum misprediction penalty of six

clock cycles, as opposed to the G4’s minimum misprediction penalty of only

four cycles.

In conclusion, both the Pentium 4 and the G4e spend more resources

than their predecessors on branch prediction, because their deeper pipelines

make mispredicted branches a major performance killer.

The Pentium 4 and G4e do actually have one more branch prediction

trick up their sleeves that’s worth at least noting, even though I won’t discuss it in any detail. That trick comes in the form of
software branch hints
, or extra information that a compiler or programmer can attach to conditional branch

instructions. This information gives the branch predictor clues as to the

Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies

147

expected behavior of the branch, whether the compiler or programmer

expects it to be taken or not taken. There doesn’t seem to be much infor-

mation available on how big of a help these hints are, and Intel at least

recommends that they be used sparingly since they can increase code size.

An Overview of the Pentium 4’s Architecture

Even though the Pentium 4’s pipeline is much longer than that of the

G4e, it still performs most of the same functions. Figure 7-5 illustrates the

Pentium 4’s basic architecture so that you can compare it to the picture of

the G4e presented in Figure 7-4. Due to space and complexity constraints,

I haven’t attempted to show each pipeline stage individually like I did with

the G4e. Rather, I’ve grouped the related ones together so you can get a

more general feel for the Pentium 4’s layout and instruction flow.

Front End

Instruction Fetch

BU

Translate
x
86/

Decode

Branch

Unit

L1 Instruction Cache

(Trace Cache)

BU

Trace Cache Fetch

(TC)

Back End

uop

Queue

Reorder Buffer (ROB)

Integer & General FP

Memory

Queue

Queue

Fast Integer

Slow Int. & General FP

Simple FP

Memory

Scheduler

Scheduler

Scheduler

Scheduler

Port 0

Port 1

Port 1

Port 1

Port 0

Load

Store

Port

Port

SIU1

SIU2

CIU

FPU

FPU

LOAD

STORE

SIU1

SIU2

SIMD

SIMD

FPU &

FPU &

Vector

Integer Units

Vector

Load-Store Unit

ALU

STORE

Re-order Buffer

(ROB)

Write

Completion Unit

Figure 7-5: Basic architecture of the Pentium 4

148

Chapter 7

The first thing to notice about Figure 7-5 is that the L1 instruction cache

is actually sitting after the fetch and decode stages in the Pentium 4’s front

end. This oddly located instruction cache—called the
trace cache
—is one of the Pentium 4’s most innovative and important features. It also greatly

affects the Pentium 4’s pipeline and basic instruction flow, so you have to

understand it before we can talk about the Pentium 4’s pipeline in detail.

Expanding the Instruction Window

Chapter 5 talked about the buffering effect of deeper pipelining on the P6

and how it allows the processor to smooth out gaps and hiccups in the code

stream. The analogy I used was that of a reservoir, which can smooth out

interruptions in the flow of water from a central source.

One of the innovations that makes this reservoir approach effective is the

decoupling of the back end from the front end by means of the reservation

Other books

Deep Down True by Juliette Fay
Galapagos Regained by James Morrow
SHUDDERVILLE TWO by Zabrisky, Mia
Bad Marie by Dermansky, Marcy
LineofDuty by Sidney Bristol