Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (61 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
3.66Mb size Format: txt, pdf, ePub

decodes the larger
x
86 instructions, as do Intel’s Pentium III and Pentium 4.

The key to understanding Figure 4-7 is that the blue layer represents a

layer of abstraction that hides the complexity of the underlying hardware

from the programmer. The blue layer is not a hardware layer (that’s the

gray one) and it’s not a software layer (that’s the peach one), but it’s a

conceptual layer
. Think of it like a user interface that hides the complexity
72

Chapter 4

of an operating system from the user. All the user needs to know to use the

machine is how to close windows, launch programs, find files, and so on. The

UI (and by this I mean the WIMP conceptual paradigm—windows, icons,

menus, pointer—not the software that implements the UI) exposes the

machine’s power and functionality to the user in a way that he or she can

understand and use. And whether that UI appears on a PDA or on a desktop

machine, the user still knows how to use it to control the machine.

The main drawback to using microcode to implement an ISA is that

the microcode engine was, in the beginning, slower than direct decoding.

(Modern microcode engines are about 99 percent as fast as direct execution.)

However, the ability to separate ISA design from microarchitectural imple-

mentation was so significant for the development of modern computing that

the small speed hit incurred was well worth it.

The advent of the
reduced instruction set computing (RISC)
movement in the 1970s saw a couple of changes to the scheme described previously. First and

foremost, RISC was all about throwing stuff overboard in the name of speed.

So the first thing to go was the microcode engine. Microcode had allowed ISA

designers to get elaborate with instruction sets, adding in all sorts of complex and specialized instructions that were intended to make programmers’ lives

easier but that were in reality rarely used. More instructions meant that you

needed more microcode ROM, which in turn meant larger CPU die sizes,

higher power consumption, and so on. Since RISC was more about less, the

microcode engine got the ax. RISC reduced the number of instructions in

the instruction set and reduced the size and complexity of each individual

instruction so that this smaller, faster, and more lightweight instruction set

could be more easily implemented directly in hardware, without a bulky

microcode engine.

While RISC designs went back to the old method of direct execution of

instructions, they kept the concept of the ISA intact. Computer architects

had by this time learned the immense value of not breaking backward com-

patibility with old software, and they weren’t about to go back to the bad old

days of marrying software to a single product. So the ISA stayed, but in a

stripped-down, much simplified form that enabled designers to implement

directly in hardware the same lightweight ISA over a variety of different

hardware types.

NOTE

Because the older, non-RISC ISAs featured richer, more complex instruction sets, they
were labeled
complex instruction set computing (CISC)
ISAs in order to distinguish them from the new RISC ISAs. The
x
86 ISA is the most popular example of a
CISC ISA, while PowerPC, MIPS, and Arm are all examples of popular RISC ISAs.

Moving Complexity from Hardware to Software

RISC machines were able to get rid of the microcode engine and still retain

the benefits of the ISA by moving complexity from hardware to software.

Where the microcode engine made CISC programming easier by providing

programmers with a rich variety of complex instructions, RISC programmers

depended on high-level languages, like C, and on compilers to ease the

burden of writing code for RISC ISAs’ restricted instruction sets.

Superscalar Execution

73

Because a RISC ISA’s instruction set is more limited, it’s harder to write

long programs in assembly language for a RISC processor. (Imagine trying to

write a novel while restricting yourself to a fifth grade vocabulary, and you’ll get the idea.) A RISC assembly language programmer may have to use many

instructions to achieve the same result that a CISC assembly language pro-

grammer can get with one or two instructions. The advent of high-level

languages (HLLs), like C, and the increasing sophistication of compiler

technology combined to effectively eliminate this programmer-unfriendly

aspect of RISC computing.

The ISA was and is still the optimal solution to the problem of easily and

consistently exposing hardware functionality to programmers so that soft-

ware can be used across a wide range of machines. The greatest testament

to the power and flexibility of the ISA is the longevity and ubiquity of the

world’s most popular and successful ISA: the
x
86 ISA. Programs written for the Intel 8086, a chip released in 1978, can run with relatively little modification on the latest Pentium 4. However, on a microarchitectural level, the

8086 and the Pentium 4 are as different as the Ford Model T and the Ford

Mustang Cobra.

Challenges to Pipelining and Superscalar Design

I noted previously that there are conditions under which two arithmetic

instructions cannot be “safely” dispatched in parallel for simultaneous exe-

cution by the DLW-2’s two ALUs. Such conditions are called
hazards
, and

they can all be placed in one of three categories:

z

Data hazards

z

Structural hazards

z

Control hazards

Because pipelining is a form of parallel execution, these three types of

hazards can also hinder pipelined execution, causing bubbles to occur in

the pipeline. In the following three sections, I’ll discuss each of these types

of hazards. I won’t go into a huge amount of detail about the tricks that

computer architects use to eliminate them or alleviate their affects, because

we’ll discuss those when we look at specific microprocessors in the next few

chapters.

Data Hazards

The best way to explain what a
data hazard
is to illustrate one. Consider Program 4-1:

Line #

Code

Comments

1

add A, B, C

Add the numbers in registers A and B and store the result in C.

2

add C, D, D

Add the numbers in registers C and D and store the result in D.

Program 4-1: A data hazard

74

Chapter 4

Because the second instruction in Program 4-1 depends on the out-

come of the first instruction, the two instructions cannot be executed

simultaneously. Rather, the add in line 1
must
finish first, so that the result is available in C for the add in line 2.

Data hazards are a problem for both superscalar and pipelined execution.

If Program 4-1 is run on a superscalar processor with two integer ALUs, the

two add instructions cannot be executed simultaneously by the two ALUs.

Rather, the ALU executing the add in line 1 has to finish first, and then the

other ALU can execute the add in line 2. Similarly, if Program 4-1 is run on a

pipelined processor, the second add has to wait until the first add completes

the write stage before it can enter the execute phase. Thus the dispatch

circuitry has to recognize the add in line 2’s dependence on the add in line 1,

and keep the add in line 2 from entering the execute stage until the add in line 1’s result is available in register C.

Most pipelined processors can do a trick called
forwarding
that’s aimed at alleviating the effects of this problem. With forwarding, the processor takes

the result of the first add from the ALU’s output port and feeds it directly

back into the ALU’s input port, bypassing the register-file write stage. Thus

the second add has to wait for the first add to finish only the execute stage, and not the execute and write stages, before it’s able to move into the execute

stage itself.

Register renaming
is a trick that helps overcome data hazards on superscalar machines. Since any given machine’s programming model often specifies

fewer registers than can be implemented in hardware, a given microprocessor

implementation often has more registers than the number specified in the

programming model. To get an idea of how this group of additional registers

is used, take a look at Figure 4-8.

In Figure 4-8, the DLW-2’s programmer thinks that he or she is using a

single ALU with four architectural general-purpose registers—A, B, C, and D—

attached to it, because four registers and one ALU are all that the DLW

architecture’s programming model specifies. However, the actual superscalar

DLW-2 hardware has two ALUs and 16 microarchitectural GPRs implemented

in hardware. Thus the DLW-2’s register rename logic can map the four archi-

tectural registers to the available microarchitectural registers in such a way as to prevent false register name conflicts.

In Figure 4-8, an instruction that’s being executed by IU1 might think

that it’s the only instruction executing and that it’s using registers A, B, and C, but it’s actually using rename registers 2, 5, and 10. Likewise, a second instruction executing simultaneously with the first instruction but in IU2 might also

think that it’s the only instruction executing and that it has a monopoly on

the register file, but in reality, it’s using registers 3, 7, 12, and 16. Once both IUs have finished executing their respective instructions, the DLW-2’s write-back logic takes care of transferring the contents of the rename registers back

to the four architectural registers in the proper order so that the program’s

state can be changed.

Superscalar Execution

75

itm04_03.fm Page 76 Thursday, January 11, 2007 10:23 AM

Programming Model

(Architecture)

ALU Registers

A

B

C

D

Hardware Implementation

(Microarchitecture)

Rename Buffer

1

IU1 Registers

2

A

3

B

4

C

5

D

6

7

8

9

10

IU2 Registers

11

A

12

B

13

C

14

D

15

16

Figure 4-8: Register renaming

Let’s take a quick look at a false register name conflict in Program 4-2.

Line #

Code

Comments

1

add A, B, C

Add the numbers in registers A and B and store the result in C.

2

add D, B, A

Add the numbers in registers B and D and store the result in A.

Program 4-2: A false register name conflict

In Program 4-2, there is no data dependency, and both add instructions

can take place simultaneously except for one problem: the first add reads the

contents of A for its input, while the second add writes a new value into A as its output. Therefore, the first add’s read absolutely must take place
before
the second add’s write. Register renaming solves this register name conflict by

allowing the second add to write its output to a temporary register; after both

adds have executed in parallel, the result of the second add is written from

that temporary register into the architectural register A after the first add has finished executing and written back its own results.

Structural Hazards

Program 4-3 contains a short code example that shows superscalar execution

in action. Assuming the programming model presented for the DLW-2,

consider the following snippet of code.

76

Chapter 4

Line #

Code

Comments

15

add A, B, B

Add the numbers in registers A and B and store the result in B.

16

add C, D, D

Add the numbers in registers C and D and store the result in D.

Program 4-3: A structural hazard

At first glance, there appears to be nothing wrong with Program 4-3.

There’s no data hazard, because the two instructions don’t depend on each

other. So it should be possible to execute them in parallel. However, this

example presumes that both ALUs share the same group of four registers.

But in order for the DLW-2’s register file to accommodate multiple ALUs

accessing it at once, it needs to be different from the DLW-1’s register file in one important way: it must be able to accommodate two simultaneous writes.

Otherwise, executing Program 4-3’s two instructions in parallel would trigger

what’s called a
structural hazard
, where the processor doesn’t have enough resources to execute both instructions at once.

The Register File

In a superscalar design with multiple ALUs, it would take an enormous

number of wires to connect each register directly to each ALU. This problem

gets worse as the number of registers and ALUs increases. Hence, in super-

scalar designs with a large number of registers, a CPU’s registers are grouped

together into a special unit called a
register file
. This unit is a memory array, much like the array of cells that makes up a computer’s main memory, and

Other books

Molly Fox's Birthday by Deirdre Madden
Soul Crossed by Lisa Gail Green
Laying Down the Paw by Diane Kelly
The Agency by Ally O'Brien
Agent Hill: Powerless by James Hunt
The Best American Mystery Stories 2015 by James Patterson, Otto Penzler