It’s discouraging that
there’s an easily defined problem (“does this program
halt?”) that computers can’t reliably solve. However, that specific
problem is relatively abstract, and the
do_the_opposite.rb
program we used to
illustrate it is impractical and contrived; it doesn’t seem likely that
we’ll ever want to actually implement#halts?
, or write a program like
do_the_opposite.rb
, as part of a real-world
application. Perhaps we can disregard undecidability as an academic
curiosity and get on with our lives.
Unfortunately it’s not that simple, because the halting problem is
not the only undecidable problem. There are plenty of others that we might
realistically want to solve in our day-to-day work of building software,
and their undecidability has real consequences for the practical
limitations of automated tools and processes.
Let’s look at a toy example. Suppose we’ve been given the job of developing a Ruby program
to print the string'hello world'
. That sounds simple
enough, but in our capacity as inveterate procrastinators
[
79
]
we’re also going to develop an automated tool that can reliably decide whether or
not a particular program printshello world
when supplied
with a particular input.
[
80
]
Armed with this tool, we can analyze our final program and check that it does what
it’s supposed to.
Now, imagine we succeed in developing a method#prints_hello_world?
that can correctly make
that decision about any program. Omitting implementation details, the
method has this general form:
def
prints_hello_world?
(
program
,
input
)
# parse program
# analyze program
# return true if program prints "hello world", false if not
end
Once we’ve finished writing our original program, we can use#prints_hello_world?
to verify that it
does the right thing; if it does, we’ll check it into source control,
email the boss, and everyone’s happy. But the situation is even better
than that, because we can also use#prints_hello_world?
to implement another
interesting method:
def
halts?
(
program
,
input
)
hello_world_program
=
%Q{
program =
#{
program
.
inspect
}
input = $stdin.read
evaluate(program, input) # evaluate program, ignoring its output
print 'hello world'
}
prints_hello_world?
(
hello_world_program
,
input
)
end
The%Q
syntax quotes a string
in the same way as%q
and then
performs interpolation, so#{program.inspect}
gets replaced with a Ruby
string literal containing the value ofprogram
.
Our new version of#halts?
works
by constructing a special program,hello_world_program
, which does two main
things:
Evaluatesprogram
withinput
available on its standard input
Printshello world
hello_world_program
is constructed so that its
execution has only two possible outcomes: eitherevaluate(program,
will finish successfully, in which case
input)hello
will be printed, or
worldevaluate(program,
will loop forever and there’ll be no output at all.
input)
This special program is fed into#prints_hello_world?
to find out which of those two outcomes will happen. If#prints_hello_world?
returnstrue
, that meansevaluate(program, input)
will eventually finish and allowhello world
to be printed, so#halts?
returnstrue
to indicate thatprogram
halts oninput
. If#prints_hello_world?
instead returnsfalse
, that must be becausehello_world_program
will never reach its final line, so#halts?
returnsfalse
to say thatevaluate(program, input)
loops forever.
Our new implementation of#halts?
shows that the
halting problem is
reducible
to the problem of checking whether a
program printshello world
. In other words, any algorithm
that computes#prints_hello_world?
can be adapted to make
an algorithm that computes#halts?
.
We already know that a working#halts?
can’t exist, so
the obvious conclusion is that a complete implementation of#prints_hello_world?
can’t exist either. And if it’s impossible to implement, the
Church–Turing thesis says there’s no algorithm for it, so “does this program printhello world
?” is another undecidable problem.
In reality, nobody cares about automatically detecting whether a program prints a
particular string, but the structure of this undecidability proof points to something larger
and more general. We only had to construct a program that exhibits the “printshello world
” property whenever some other program halts, and that
was enough to show undecidability. What stops us from reusing this argument for
any
property of program behavior, including properties that we actually
do care about?
Well, nothing does. This is
Rice’s theorem
:
any nontrivial property of program behavior is undecidable,
because the halting problem can always be reduced to the problem of
deciding whether that property is true; if we could invent an algorithm
for deciding that property, we’d be able to use it to build another
algorithm that decides the halting problem, and that’s impossible.
Roughly speaking, a “nontrivial property” is a claim about
what
a
program does, not
how
it does it. For example, Rice’s theorem doesn’t
apply to a purely syntactic property like “does this program’s source code contain the
string'reverse'
?,” because that’s an incidental
implementation detail that can be refactored away without changing the program’s externally
visible behavior. On the other hand, a semantic property like “does this program output the
reverse of its input?” is within the scope of Rice’s theorem and therefore
undecidable.
Rice’s theorem tells us there are a huge number of undecidable
problems that are concerned with what a program will do when it’s
executed.
Undecidability is
an inconvenient fact of life. The halting problem is disappointing, because it
shows we can’t have everything: we want the unrestricted power of a universal programming
language, but we also want to write programs that produce a result without getting stuck in an
infinite loop, or at least programs whose
subroutines
halt as part of
some larger long-running task (see
Very Long-Running Computations
).
This disappointment is bleakly summarized in a classic
paper from 2004:
There is a dichotomy in language design, because of the halting
problem. For our programming discipline we are forced to choose
between
Security—a language in which all programs are known to
terminate.Universality—a language in which we can write
all terminating programs
silly programs which fail to terminate
and, given an arbitrary program we cannot in general say if it
is (i) or (ii).Five decades ago, at the beginning of electronic computing, we
chose (B).—
David Turner,
Total
Functional Programming
Yes, we’d all like to avoid writing silly programs, but that’s just
tough luck. There’s no way to tell whether an arbitrary program is silly,
so we can’t completely avoid writing them without
sacrificing universality.
[
81
]
The implications of Rice’s theorem
are depressing too: not only is the question “does this program halt?”
undecidable, but so is “does this program do what I want it to do?” We live in a universe
where there’s no way to build a machine that can accurately predict whether a program will
printhello world
, calculate a particular mathematical
function, or make a particular operating system call, and that’s just the way it is.
That’s frustrating, because it would be really useful to be able to check program
properties mechanically; the reliability of modern software would be improved by a tool that
decides whether a program conforms to its specification or contains any bugs. Those properties
might be mechanically checkable for individual programs, but unless they’re checkable in
general, then we’ll never be able to trust machines to do the job for us.
For example, say we invent a new software platform and decide to
make money by selling compatible programs from an online shop—an
“application superstore,” if you like—on behalf of our platform’s
third-party developers. We want our customers to be able to shop with
confidence, so we decide to only sell programs that meet certain criteria:
they must not crash, they must not call private APIs, and they must not
execute arbitrary code downloaded from the Internet.
When thousands of developers start submitting programs to us, how do we review each one to
see whether it falls within our guidelines? It would save a lot of time and money if we could
use an automated system to check every submission for compliance, but thanks to
undecidability, it’s not possible to build a system that does the job accurately. We have no
choice but to hire a small army of human beings to manually test those programs by running
them, disassembling them, and instrumenting the operating system to profile their dynamic
behavior.
Manual review is slow, expensive, and error-prone, with the added
drawback that each program can only be run for a short amount of time,
providing a limited snapshot of its dynamic behavior. So even if nobody
makes a mistake, occasionally something undesirable will slip through the
net and we’ll have a load of angry customers on our hands. Thanks a lot,
undecidability.
Beneath all this inconvenience are two fundamental problems. The
first is that we don’t have the power to look into the future and see what
will happen when a program is executed; the only general way to find out
what a program does is to run it for real. While some programs are simple
enough to have behavior that’s straightforwardly
predictable, a universal language will always permit
programs whose behavior can’t be predicted just by analyzing their source
code.
[
82
]
The second problem is that, when we do decide to run a program,
there’s no reliable way to know how long it will take to finish. The only
general solution is to run it and wait, but since we know that programs in
a universal language can loop forever without halting, there will always
be some programs for which no finite amount of waiting is long
enough.
In this chapter, we’ve seen that all universal systems are powerful enough to refer to
themselves. Programs operate on numbers, numbers can represent strings, and strings are how
the instructions of a program are written down, so programs are perfectly capable of operating
on their own source code.
This capacity for
self-reference makes it impossible to write a program that
can reliably predict program behavior. Once a particular behavior-checking
program has been written, we are always able to construct a larger program
that defeats it: the new program incorporates the checker as a subroutine,
checks its own source code, and then immediately does the exact opposite
of whatever the checker said it would do. These self-contradictory
programs are curiosities rather than something we’d ever write in
practice, but they’re just a symptom, not the cause, of the underlying
problem: in general, program behavior is too powerful to be accurately
predicted.
Human language has similar power and similar problems. “This
sentence is a lie” (the
liar paradox
) is an
English sentence that can’t be true or false. But the liar paradox
depends on the special self-referential word “this”; as we saw in
Programs Can Refer to Themselves
, any computer program can be made
self-referential
by construction
, without requiring
any special language features.
When it comes down to it, there are two basic reasons why the
behavior of programs is so hard to predict:
Any system with enough power to be
self-referential can’t correctly answer every question about itself.
[
83
]
We will always be able to construct a program like
do_the_opposite.rb
whose behavior can’t be predicted by the system. To avoid
this problem, we need to step outside the self-referential system and use a different,
more powerful
system to answer questions about it.
But in the case of universal programming languages, there
is
no
more powerful system for us to upgrade to. The Church–Turing thesis tells us that any
usable algorithm we invent for making predictions about the behavior of programs can
itself be performed by a program, so we’re stuck with the capabilities of universal
systems.