Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
We’ll talk more about the concept of the instruction window and about the
structures that make it up (the ROB and the reservation stations) in the next
section on the 604. For now, it suffices to say that the 603’s instruction window is quite small compared to that of its successors—three of its four reservation
stations are only single-entry, and one is double-entry (the one attached to
the load-store unit). Because the 603’s instruction window is so small, it needs relatively few rename registers to temporarily hold execution results prior to
commitment. The 603 has five general-purpose rename registers, four
floating-point rename registers, and one rename register each for the
condition register (CR), link register (LR), and count register (CTR).
The 603 and 603e follow the 601 in their ability to do speculative execu-
tion by means of a simple, static branch predictor. Like the static predictor
on the 601, the 603e’s predictor marks forward branches as not taken and
backward branches as taken. This static branch predictor is simple and fast,
but it is only mildly effective compared to even a weakly designed dynamic
branch predictor. If PPC users in the 603e/604 era wanted dynamic branch
prediction, they had to upgrade to the 604.
Summary: The 603 and 603e in Historical Context
With its stellar performance-per-watt ratio, the 603 was a great little processor, and it would have made a good low- to midrange desktop processor as well if
it weren’t for Apple’s legacy 68K code base. The 603e’s tweaks and larger cache
size helped with the legacy problems somewhat, but the updated chip still played second fiddle in Apple’s product line to the larger, much more powerful 604.
You haven’t seen the last of the 603e, though. The 603e’s design formed
the basis for what would eventually become Motorola’s PowerPC 7400—aka the
G4—which
we’ll cover in “The PowerPC 7400 (aka the G4)” on page 133.
122
Chapter 6
The PowerPC 604
At the same time the 603 was making its way toward the market, the 604 was
in the works as well. The 604 was to be Apple’s high-end PPC desktop proc-
essor, so its power and transistor budgets were much higher than that of the
603. Table 6-3 summarizes the 604’s features, and a quick glance at a diagram
of the 604 (see Figure 6-3) shows some obvious ways that it differs from its
lower-end sibling. For example, in the front end, the length of the instruction
queue has been increased by two entries. In the back end, two more integer
units have been added, and the CR logical unit has been removed. These
changes reflect some important differences in the overall approach of the
604, differences that will be examined in greater detail shortly.
Table 6-3:
Features of the PowerPC 604 and 604e
PowerPC 604
PowerPC 604e
Introduction Date
May 1, 1995
July 19, 1996
Process
0.50 micron
0.35 micron
Transistor Count
3.6 million
5.1 million
Die Size
197 mm2
148 mm2
Clock Speed at Introduction
120 MHz
180–200 MHz
L1 Cache Size
32KB split L1
64KB split L1
First Appeared In
PowerMac
Power Computing PowerTower Pro 200
9500/120
(PowerMac 9500/180 on August 7, 1996)
The 604’s Pipeline and Back End
The 604’s pipeline is deeper than that of the 601 and the 603, and it consists
of the following six stages:
Four Phases of the Standard RISC Pipeline
Six Stages of the 604’s Pipeline
Fetch
1. Fetch
Decode/dispatch
2. Decode
3. Dispatch (ROB and rename)
Execute
4. Execute
Write-back
5. Complete
6. Write-back
In the 604, the standard RISC decode/dispatch phase is split into two
stages, as is the write-back phase. I’ll explain just how these two new pipeline stages work in the section on the instruction window, but for now all you need
to understand is that this lengthened pipeline enables the 604 to reach higher
clock speeds than its predecessors. Because each pipeline stage is simpler, it
takes less time to complete, which means that the CPU’s clock cycle time can
be shortened.
PowerPC Processors: 600 Series, 700 Series, and 7400
123
Front End
Instruction Fetch
BU
Instruction Queue
Branch
Unit
CR
Decode/Dispatch
Reserv.
Reserv.
Reserv. Reserv.
Reserv.
Station
Station
Station
Station
Station
VPU-1
FPU-1
SIU-1
SIU-1
CIU-1
LSU-1
FPU-2
CIU-2
FPU-3
CIU-3
Load-
Floating-
Integer
Store
Point Unit
ALU
Unit
Memory Access
Scalar Arithmetic Logic Units
Units
Back End
Reorder Buffer
(16-entry)
Write
Commit Unit
Figure 6-3: PowerPC 604 microarchitecture
Aside from the longer pipeline, another factor that really sets the 604
apart from the other 600-series PPC designs discussed so far is its wider back
end. The 604 can execute up to six instructions per clock cycle in the following six execution units:
z
Branch unit (BU)/condition register unit (CRU)
z
Load-store unit (LSU)
z
Floating-point unit (FPU)
124
Chapter 6
z
Three integer units (IU)
z
Two simple integer units (SIUs)
z
One complex integer unit (CIU)
Unlike the other 600-series processors, the 604 has multiple integer units.
This division of labor, where multiple fast integer units executed simple integer instructions and one slower integer unit execute complex integer instructions,
will be discussed in more detail in Chapter 8. Any integer instruction that takes only a single cycle to execute can pass through one of the two SIUs. On the
other hand, integer instructions that take multiple cycles to execute, like
integer divides, have to pass through the slower CIU.
Like the 603e, the 604 has
register renaming
, a technique that is facilitated by the 12-entry register rename file attached to the 32-entry general-purpose
register file. These rename buffers allow the 604’s execution units more options for avoiding false dependencies and register-related stalls.
The 604’s floating-point unit does most single- and double-precision opera-
tions with a three-cycle latency, just like the 603e. Unlike the 603e, though,
the 604’s floating-point unit is fully pipelined for double-precision multiplies.
Floating-point division and two other instructions take from 18 to 33 cycles
on the 604, as on the 603e. Finally, the 604’s 32-entry floating-point register
file is attached to an 8-entry floating-point rename register buffer.
The 604’s load-store unit (LSU) is also similar to that of the 603e. Like the
603e’s LSU, it contains an adder for doing address calculations and handles
all load-store traffic, but unlike the 603e, it’s connected to deeper load and
store queues and allows a little more flexibility for the optimal reordering of
memory operations.
The 604’s branch unit also features a dynamic branch prediction scheme
that’s a vast improvement over the 603e’s static branch predictor. The 604
has a large, 512-entry branch history table (BHT) with two bits per entry for
tracking branches, coupled with a 64-entry
branch target address cache (BTAC)
, which is the equivalent of the Pentium’s BTB.
As always, the more transistors you spend on branch prediction, the
better performance is, so the 604’s more advanced branch unit helps it quite
a bit. Still, in the case of a misprediction, the 604’s longer pipeline has to pay a higher price than its shorter-pipelined predecessors in terms of performance.
Of course, the bigger performance loss associated with a misprediction is
also the reason the 604 needs to spend those extra resources on branch
prediction.
Notice that the list of execution units on page
124 is missing
a unit that is present on the 603e: the system unit. The 603e’s system unit handled updates
to the PPC condition register, a function that was handled by the integer exe-
cution unit on the older 601. The 604 moves the responsibility of dealing with
the condition register onto the branch unit. So the 604’s branch unit contains
a separate execution unit that handles all logical operations that involve the
PowerPC condition register. This condition register unit (CRU) shares a
PowerPC Processors: 600 Series, 700 Series, and 7400
125
dispatch bus and some other resources with the branch execution unit, so
it’s not a fully independent execution unit like the 603e’s system unit. What
does this BU/CRU combination do for performance? It probably doesn’t have
a huge impact, but whatever impact it does have is significant enough to where
the 604’s immediate successor—the 604e—adds an independent execution
unit to the back end for CR logical operations.
The 604’s Front End and Instruction Window
The 604’s front end and instruction window look like a combination of the
best features of the 601 and the 603e. Like the 601, the 604’s instruction
queue is eight entries deep. Instructions are fetched from the L1 cache into
the instruction queue, where they’re decoded before being dispatched to the
back end. Branches that can be folded are folded, and the 604’s dispatch logic
can dispatch up to four instructions per cycle (up from two on the 603e and
three on the 601) from the bottom four entries of the instruction queue to
the back end’s execution units.
During the 604’s dispatch stage, rename registers and a reorder buffer
entry are assigned to each dispatching instruction. When the instruction is
ready to dispatch, it’s sent either directly to an execution unit or to an execution unit’s reservation station, depending on whether or not its operands are
available at the time of dispatch. Note that the 604 can dispatch at most one
instruction to each execution unit, and there are certain rules that govern
when the dispatch logic can dispatch an instruction to the back end. We’ll
cover these rules in more detail in a moment, but for now you need to be
aware of one of the rules: An instruction cannot dispatch if the execution
unit that it needs is not available.
The Issue Phase: The 604’s Reservation Stations
In Figure 6-3, you probably noticed that each of the 604’s execution units has
a reservation station attached to it; this includes a reservation station each
(not depicted) for the branch execution and condition register units that
make up the branch unit. The 604’s reservation stations are relatively small,
two-entry (the CIU’s reservation station is single-entry), first-in first-out (FIFO) affairs, but they make up the heart of the 604’s instruction window, because
they allow the instructions assigned to one execution unit to issue out of
program order with respect to the instructions that are assigned to the other
execution units.
This works as follows: The dispatch stage sends instructions into the
reservation stations (i.e., the issue phase) in program order, and, with one
important exception (described in the next paragraph), the instructions pass
through their respective reservation stations in order. An instruction enters
the top of a reservation station, and as the instructions ahead of it issue, it moves down the queue, until it eventually exits through the bottom (i.e., it issues).
126
Chapter 6
Therefore, we can say each instruction issues in order with respect to the other instructions in its same reservation station. However, the various reservation
stations can issue instructions at different times, with the result that instructions issue out of order from the perspective of the overall program flow.
The simple integer units function a little differently than described earlier,
because they allow instructions to issue from their two-entry reservation stations out of order with respect to the other instructions in their own execution unit.
So unlike other types of instructions described previously, integer instructions can move through their respective reservation stations and pipelines out of
program order, not just with respect to the overall program flow, but with
respect to the other instructions in their own reservation station.
The reservation stations in the 604 and its architectural successors exist
to keep instructions that lack their input operand data but are otherwise
ready to dispatch from tying up the instruction queue. If an instruction meets
all of the other dispatch requirements
(see “The Four Rules of Instruction
Dispatch”
), and if its assigned execution unit is available but it just doesn’t yet have access to the part of the data stream that it needs, it dispatches to