Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (64 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
11.7Mb size Format: txt, pdf, ePub

intensive office apps.

The final thing worth noting about the Pentium’s two integer ALUs is

that they are responsible for many of the processor’s address calculations.

The Intel Pentium and Pentium Pro

87

More recently designed processors have specialized hardware for handling

the address calculations associated with loads and stores, but on the Pentium

these calculations are done in the integer ALUs.

The Floating-Point ALU

Floating-point operations are usually more complex to implement than

integer operations, so floating-point pipelines often feature more stages

than integer pipelines. The Pentium’s six-stage floating-point pipeline is no

exception to this rule. The Pentium’s floating-point performance is limited

by two main factors. First, the processor can only dispatch both a floating-

point and an integer operation simultaneously under extremely restrictive

circumstances. This isn’t too bad, though, because floating-point and integer

code are rarely mixed. The second factor, the unfortunate design of the
x
87

floating-point architecture, is more important.

In contrast to the average RISC ISA’s
flat
floating-point register file, the
x
87 register file contains eight 80-bit registers arranged in the form of a stack.

A
stack
is a simple data storage structure commonly used by programmers

and some scientific calculators to perform arithmetic.

NOTE

Flat
is an adjective that programmers use to describe an array of elements that is logically laid out so that any element is accessible via a simple address. For instance, all
of the register files that we’ve seen so far are flat, because a programmer needs to know
only the name of the register in order to access that register. Contrast the flat file with the
stack structure described next, in which elements that are inside the data structure are
not immediately and directly accessible to the programmer.

As Figure 5-4 illustrates, a programmer writes data to the stack by
pushing

it onto the top of the stack via the push instruction. The stack therefore grows with each new piece of data that is pushed onto its top. To read data from

the stack, the programmer issues a pop instruction, which returns the top-

most piece of data and removes that data from the stack, causing the stack

to shrink.

As the stack grows and shrinks, the variable ST, which stands for the
stack

top
, always points to the top element of the stack. In the most basic type of stack, ST is the only element of the stack that can be directly accessed by the

programmer—it is read using the pop command, and it is written to using the

push command. This being the case, if you want to read the blue element

from the stack in Figure 5-4, you have to pop all of the elements above it, and

then you have to pop the blue element itself. Similarly, if you want to alter

the blue element, you first have to pop all of the elements above it. Then

you pop the blue element itself, alter it, and then push the modified element

back onto the stack.

Because the first item that you place in a stack is not accessible until

you’ve removed all the items above it, a stack is often called a
FILO (first in,
last out)
data structure. Contrast this with a traditional queue structure, like a supermarket checkout line, which is a
FIFO (first in, first out)
structure.

88

Chapter 5

3

ST

3

ST

ST

Empty Stack

Push ( )

Push ( )

Push ( )

3

ST

3

ST

ST

ST

Push ( )

Push ( )

Pop

Pop

Figure 5-4: Pushing and popping data on a simple stack

All of this pushing and popping sounds like a lot of work, and you might

wonder why anyone would use such a data structure. As it turns out, a stack

is ideal for certain specialized types of applications, like parsing natural language, keeping track of nested procedure calls, and evaluating postfix arith-

metic expressions. It was the stack’s utility for evaluating postfix arithmetic

expressions that recommended it to the designers of the
x
87 floating-point unit (FPU), so they arranged the FPU’s eight floating-point registers as a

stack.

NOTE

Normal arithmetic expressions, like 5 + 2 – 1 = 6, are called
infix
expressions, because
the arithmetic operators (+ and –) are situated in between the numbers on which they
operate.
Postfix
expressions, in contrast, have the operators affixed to the end of the
expression, e.g. 521–+ = 6. You could evaluate this expression from left to right using a
stack by pushing the numbers 5, 2, and 1 onto the stack (in that order), and then popping them back off (first 1, then 2, and finally 5) as the operators at the end of the
expression are encountered. The operators would be applied to the popped numbers as
they appear, and the running result would be stored in the top of the stack.

The
x
87 register file is a little different than the simple stack described two paragraphs ago, because ST is not the only variable through which the

stack elements can be accessed. Instead, the programmer can read and

write the lower elements of the stack by using ST with an index value that

designates the desired element’s position relative to the top of the stack.

For example, in Figure 5-5, the stack is at its tallest when the green value

has just been pushed onto it. This green value is accessed via the variable

ST(0), because it occupies the top of the stack. The blue value, because it is

three elements down from the top of the stack, is accessed via ST(3).

The Intel Pentium and Pentium Pro

89

3

ST(0)

3

ST(0)

4

ST(1)

ST(0)

4

ST(1)

5

ST(2)

Empty Stack

Push ( )

Push ( )

Push ( )

3

ST(0)

3

ST(0)

4

ST(1)

ST(0)

4

ST(1)

5

ST(2)

ST(1)

ST(0)

5

ST(2)

6

ST(3)

ST(2)

ST(1)

6

ST(3)

7

ST(4)

ST(3)

ST(2)

Push ( )

Push ( )

Pop

Pop

Figure 5-5: Pushing and popping data on the
x
87 floating-point register stack
In general, to read from or write to a specific register in the stack, you

can just use the form ST(
i
), where
i
is the number of registers from the top of the stack.

Programming purists might suggest that since you can access its stack

elements arbitrarily, it’s kind of pointless to still call the
x
87 register file a stack.

This would be true except for one catch: For every floating-point arithmetic

instruction, at least one of the operands must be the stack top. For instance,

if you want to add two floating-point numbers, one of the numbers must

be in the stack top and the other can be in any of the other registers. For

example, the instruction

fadd ST, ST(5)

performs the operation

ST = ST + ST(5)

Though the stack-based nature of
x
87’s floating-point register file was originally a boon to assembly language programmers, it soon began to

become an obstacle to floating-point performance as compilers saw more

widespread use. A flat register file is easier for a compiler to manage, and

the newer RISC ISAs featured not only large, flat register files but also

three-operand floating-point instructions.

While compiler tricks are arguably enough to make up for
x
87’s two-

operand limit under most circumstances, they’re not quite able to overcome

both the two-operand limit and the stack-based limit. So compiler tricks alone

won’t eliminate the performance penalties associated with both of these

90

Chapter 5

quirks combined. The stack-based register file is bad enough that a micro-

architectural hack is needed in order simulate a flat register file and thereby

keep the
x
87’s design from hobbling floating-point performance.

This microarchitectural hack involves turbocharging a single instruction:

fxch. The fxch instruction is an ordinary
x
87 instruction that allows you to swap any element of the stack with the stack top. For example, if you wanted

to calculate ST(2) = ST(2) + ST(6), you might execute the code shown in

Program 5-1:

Line # Code

Comments

1

fxch ST(2)

Place the contents of ST(2) into ST and the contents of ST into ST(2).

2

fadd ST, ST(6)

Add the contents of ST to ST(6).

3

fxch ST(2)

Place the contents of ST(2) into ST and the contents of ST into ST(2).

Program 5-1: Using the fxch instruction

Now, here’s where the microarchitectural hack comes in. On all modern

x
86 designs, from the original Pentium up to but not including the Pentium 4, the fxch instruction can be executed in zero cycles. This means that for all

intents and purposes, fxch is “free of charge” and can therefore be used when

needed without a performance hit. (Note, however, that the fxch instruction

still takes up decode bandwidth, so even when it’s “free,” it’s not entirely

“free.”) If you stop and think about the fact that, before executing any floating-point instruction (which has to involve the stack top), you can instantaneously

swap ST with any other register, you’ll realize that a zero-cycle fxch instruction gives programmers the functional equivalent of a flat register file.

To revisit the previous example, the fact that the first instruction in

Program 5-1 executes “instantaneously,” as it were, means that the series of

operations effectively looks as follows:

fadd ST(2), ST(6)

There are in fact some limitations on the use of the “free” fxch instruc-

tion, but the overall result is that by using this trick, both the Pentium and its successors get the effective benefits of a flat register file, but with the aforementioned hit to decode bandwidth.

x
86 Overhead on the Pentium

There are a number of places, like the Pentium’s decode-2 stage, where

legacy
x
86 support adds significant overhead to the Pentium’s design. Intel has estimated that a whopping 30 percent of the Pentium’s transistors are

dedicated solely to providing
x
86 legacy support. When you consider the fact that the Pentium’s RISC competitors with comparable transistor counts could

spend those transistors on performance-enhancing hardware like execution

units and cache, it’s no wonder that the Pentium lagged behind some of its

contemporaries when it was first introduced.

The Intel Pentium and Pentium Pro

91

A large chunk of the Pentium’s legacy-supporting transistors are eaten

up by its microcode ROM. Chapter 4 explained that one of the big benefits

of RISC processors is that they don’t need the microcode ROMs that CISC

designs require for decoding large, complex instructions. (For more on
x
86

as a CISC ISA, see the
section “CISC, RISC, and Instruction Set Translation”

on page 103.
)

The front end of the Pentium also suffers from
x
86-related bloat, in that its prefetch logic has to take account of the fact that
x
86 instructions are not a uniform size and hence can straddle cache lines. The Pentium’s decode

logic also has to support
x
86’s segmented memory model, which means

checking for and enforcing code segment limits; such checking requires its

own dedicated address calculation hardware, in addition to the Pentium’s

other address hardware.

Summary: The Pentium in Historical Context

The primary factor constraining the Pentium’s performance versus its RISC

competitors was the fact that its entire front end was bloated with hardware

that was there solely to support
x
86 features which, even at the time of the processor’s introduction, were rapidly falling out of use. With transistor

budgets as tight as they were in 1993, each of those extra address adders

and prefetch buffers—not to mention the microcode ROM—represented

a painful expenditure of scarce resources that did nothing to enhance the

Pentium’s performance.

Fortunately for Intel, Pentium’s legacy support headaches weren’t the

end of the story. There were a few facts and trends working in the favor of

Intel and the
x
86 ISA. If we momentarily forget about ISA extensions like MMX, SSE, and so on, and the odd handful of special-purpose instructions,

like Intel’s CPU identifier instruction, that get added to the
x
86 ISA every so often, the core legacy
x
86 ISA is fixed in size and has not grown over the years.

Similarly, with one exception (the P6, covered next), the amount of hardware

Other books

Tiger Claws by John Speed
Dogs Don't Tell Jokes by Louis Sachar
Songbird by Colleen Helme
The Second Assistant by Clare Naylor, Mimi Hare
Extinction Point by Paul Antony Jones
The Seventh Wish by Kate Messner
Hijos del clan rojo by Elia Barceló
The Apothecary by Maile Meloy
An Unlikely Countess by Beverley, Jo