Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (85 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
2.3Mb size Format: txt, pdf, ePub

When the individual IOPs in a group reach their proper issue queues, they

can then be issued out of program order to the execution units at a rate of

The G5: IBM’s PowerPC 970

197

eight IOPs/cycle for all the queues combined. Before they reach the comple-

tion stage, however, they need to be placed back into their group so that an

entire group of five IOPs can be completed on each cycle. (Don’t worry if this

sounds a bit confusing right now. The group dispatch and formation scheme

will become clearer when we discuss the 970’s peculiar issue queue structure.)

The price the 970 pays for the reduced bookkeeping overhead afforded

it by the dispatch grouping scheme is a loss of execution efficiency brought

on by the diminished granularity of control that comes from being able to

dispatch, schedule, issue, and complete instructions on an individual basis.

Let me explain.

The 970’s Dispatch Rules

When the 970’s front end assembles an IOP group, there are certain rules it

must follow. The first rule is that the group’s five slots must be populated with IOPs in program order, starting with the oldest IOP in slot 0 and moving up

to newest IOP in slot 4. Another rule is that all branch instructions must go

in slot 4, and slot 4 is reserved for branch instructions only. This means that

if the front end can’t find a branch instruction to put in slot 4, it can issue

one less instruction that cycle.

Similarly, there are some situations in which the front end must insert

noops into the group’s slots in order to force a branch instruction into slot 4.

Noop
(pronounced “no op”) is short for
no operation
. It is a kind of non-instruction instruction that means “Do nothing.” In other words, the front

end must sometimes insert empty execution slots, or pipeline bubbles, into

the instruction stream in order to make the groups comply with the rules.

The preceding rules aren’t the only ones that must be adhered to when

building groups. Another rule dictates that instructions destined for the

conditional register unit (CRU) can go only in slots 0 and 1.

And then there are the rules dealing with cracked and millicoded

instructions. Consider the following from IBM’s POWER4 white paper:

Cracked instructions flow into groups as any other instructions

with one restriction. Both IOPs must be in the same group. If both

IOPs cannot fit into the current group, the group is terminated

and a new group is initiated. The instruction following the cracked

instruction may be in the same group as the cracked instruction,

assuming there is room in the group. Millicoded instructions

always start a new group. The instruction following the millicoded

instruction also initiates a new group.

And that’s not all! A group has to have the following resources available

before it can even dispatch to the back end. If just one of following resources

is too tied up to accommodate the group or any of its instructions, then the

entire group has to wait until that resource is freed up before it can dispatch:
Group completion table (GCT) entry

The group completion table is the 970’s equivalent of a reorder buffer

or completion queue. While a normal ROB keeps track of individual

in-flight instructions, the GCT tracks whole dispatch groups. The GCT

has 20 entries for keeping track of 20 active groups as the groups’

198

Chapter 10

constituent instructions make their way through the ~100 execution

slots available in the back end’s pipelines. Regardless of how few instruc-

tions are actually in the back end at a given moment, if those instructions

are grouped so that all 20 GCT entries happen to be full, no new groups

can be dispatched.

Issue queue slot

If there aren’t enough slots available in the appropriate issue queues to

accommodate all of a group’s instructions, the group must wait to dispatch.

(In a moment I’ll elaborate on what I mean by “appropriate issue queues.”)

Rename registers

There must be enough register rename resources available so that any

instruction that requires register renaming can issue when it’s dispatched

to its issue queue.

Again, when it comes to the preceding restrictions, one bad instruction

can keep the whole group from dispatching.

Because of its use of groups, the 970’s dispatch bandwidth is sensitive

to a complex host of factors, not the least of which is a sort of “internal

fragmentation” of the group completion table that could potentially arise

and needlessly choke dispatch bandwidth if too many of the groups in the

GCT are partially or mostly empty.

In order to keep dispatch bottlenecks from stopping the fetch/decode

portion of the pipeline, the 970 can buffer up to four dispatch groups in a

four-entry
dispatch queue
. So if the preceding requirements are not met and there is space in the dispatch queue, a dispatch group can move into the

queue and wait there for its dispatch requirements to be met.

Predecoding and Group Dispatch

The 970 uses a trick called
predecoding
in order to move some of the work of group formation higher up in the pipeline, thereby simplifying and speeding

up the latter decode and group formation phases in the front end. As instruc-

tions are fetched from the L2 cache into the L1 I-cache, each instruction is

predecoded and marked with a set of five predecode bits. These bits indicate

how the instruction should be grouped—in other words, if it should be first

in its group, last in its group, or unrestricted; if it will be a microcoded instruction; if it will trigger an exception; if it will be split or not; and so on. This information is used by the decode and group formation hardware to quickly

route instructions for decoding and to group them for dispatch.

The predecode hardware also identifies branches and marks them for

type—conditional or unconditional. This information is used by the 970’s

branch prediction hardware to implement branch folding, fall-through,

and branch prediction with minimal latencies.

Some Preliminary Conclusions on the 970’s Group Dispatch Scheme

In the preceding section, I went into some detail on the ins and outs of group

formation and group dispatching in the 970. If you only breezed through

The G5: IBM’s PowerPC 970

199

the section and thought, “All of that seems like kind of a pain,” then you got

90 percent of the point I wanted to make. Yes, it is indeed a pain, and that

pain is the price the 970 pays for having both width and depth at the same

time. The 970’s big trade-off is that it needs less logic to support its long pipeline and extremely wide back end, but in return, it has to give up a measure

of granularity, flexibility, and control over the dispatch and issuing of its

instructions. Depending on the makeup of the instruction stream and how

the IOPs end up being arranged, the 970 could possibly end up with quite a

few groups that are either mostly empty, partially empty, or stalled waiting

for execution resources.

So while the 970 may be theoretically able to accommodate 200 instruc-

tions in varying stages of fetch, decode, execution, and completion, the reality is probably that under most circumstances, a decent number of its execution

slots will be empty on any given cycle due to dispatch, scheduling, and com-

pletion limitations. The 970 makes up for this with the fact that it just has so many available slots that it can afford to waste some on group-related pipeline

bubbles.

The PowerPC 970’s Back End

The PowerPC 970 sports a total of 12 execution units, depending on how you

count them. Even a more conservative count that lumps together the three

SIMD integer and floating-point units and doesn’t count the branch execution

unit would still give nine execution units.

In the following three sections, I’ll discuss each of the 970’s execution

units, comparing them to the analogous units on the G4e, and in some cases

the Pentium 4. As the discussion develops, keep in mind that a simple com-

parison of the types and numbers of execution units for each of the three

processors is not at all adequate to the task of sorting out the real differences between the processors. Rather, there are complicating factors that make comparisons much more difficult than one might naïvely expect. Some of these

factors will be evident in the sections dealing with each type of execution

unit, but others won’t turn up until we discuss the 970’s issue queues in the

last third of the chapter.

NOTE

As I cover each part of the 970’s back end, I’ll specify the number of rename registers of
each type (integer, floating-point, vector) that the 970 has. If you compare these numbers to the equivalent numbers for the G4e, you’ll see that 970 has many more rename
registers than its predecessor. This increased number of rename registers is necessary
because the 970’s instruction window (up to 200 instructions in-flight) is significantly
higher than that of the G4e (up to 16 instructions in-flight). The more instructions a
processor can hold in-flight at once, the more rename registers it needs in order to pull
off the kinds of tricks that a large instruction window enables you to do—i.e., dynamic
scheduling, loop unrolling, speculative execution, and the like. In a nutshell, more
instructions on the chip in various stages of execution means more data needs to be
stored in more registers.

200

Chapter 10

Integer Unit, Condition Register Unit, and Branch Unit

In the chapters on the Pentium 4 and G4e, I described how both of these

processors embody a similar approach to integer computation in that they

divide integer operations into two types: simple and complex. Simple integer

instructions, like add, are the most common type of integer instruction and

take only one cycle to execute on most hardware. Complex integer instruc-

tions (e.g., integer division) are rarer and take multiple cycles to execute.

In keeping with the quantitative approach to computer design’s central

dictum, “Make the common case fast,”1 both the Pentium 4 and G4e split up

their integer hardware into two specialized types of execution units: a group

of units that handle only simple, single-cycle instructions and a single unit

that handles complex, multi-cycle instructions. By dedicating the majority of

their integer hardware solely to the rapid execution of the most common

types of instructions (the simple, single-cycle ones), the Pentium 4 and the

G4e are able to get increased integer performance out of a smaller amount

of overall hardware.

Think of the multiple fast IUs (or SIUs) as express checkout lanes for

one-item shoppers and the single slow IU as a general-purpose checkout lane

for multiple-item shoppers in a supermarket where most of the shoppers

only buy a single item. This kind of specialization keeps that one guy who’s

stocking up for Armageddon from slowing down the majority of shoppers

who just want to duck in and grab eggs or milk on the way home from work.

The PPC 970 differs from both of these designs in that it has two

general-purpose IUs that execute almost all integer instructions. To return to

the supermarket analogy, the 970 has two general-purpose checkout lanes in a

supermarket where most of the shoppers are one-item shoppers. The 970’s

two IUs are attached to 80 64-bit GPRs (32 architected and 48 rename).

Why doesn’t the 970 have more specialized hardware (express checkout

lanes) like the G4e and Pentium 4? The answer is complicated, and I’ll take an

initial stab at answering it in a moment, but first I should clear something up.

The Integer Units Are Not Fully Symmetric

I said that the 970’s two IUs execute “almost all” integer instructions, because the units are not, in fact, fully symmetric. One of the IUs performs fixed-point divides, and the other handles SPR operations. So the 970’s IUs are

slightly specialized, but not in the same manner as the IUs of the G4e and

Pentium 4. If the G4e and Pentium 4 have express checkout lanes, the 970

has something more akin to a rule that says, “All shoppers who bought some-

thing at the deli must go through line 1, and all shoppers who bought

something at the bakery must go through line 2; everyone else is free to

go through either line.”

Thankfully, integer divides and SPR instructions are relatively rare, so

the impact on performance of this type of forced segregation is minimized.

In fact, if the 970 didn’t have the group formation scheme, this seemingly

1 See John L. Hennessy and David A. Patterson,
Computer Architecture: A Quantitative Approach
, Third Edition (Morgan Kauffman Publishers: 2003).

The G5: IBM’s PowerPC 970

201

minor degree of specialization might hardly be worth commenting on. But

as it stands, group-related scheduling issues turn this specialization into a

potentially negative factor—albeit a minor one—for integer performance

for reasons that we’ll discuss later on in this chapter.

Integer Unit Latencies and Throughput

Other books

The Blind Man of Seville by Robert Wilson
See No Evil by Allison Brennan
Faun and Games by Piers Anthony