Read Superintelligence: Paths, Dangers, Strategies Online
Authors: Nick Bostrom
Tags: #Science, #Philosophy, #Non-Fiction
We cannot here delve into all the ethical and strategic issues associated with incentive wrapping. A project’s position on these issues, however, would be an important aspect of its fundamental design concept.
Another important design choice is which decision theory the AI should be built to use. This might affect how the AI behaves in certain strategically fateful situations. It might determine, for instance, whether the AI is open to trade with, or extortion by, other superintelligent civilizations whose existence it hypothesizes. The particulars of the decision theory could also matter in predicaments involving finite probabilities of infinite payoffs (“Pascalian wagers”) or extremely small probabilities of extremely large finite payoffs (“Pascalian muggings”) or in contexts where the AI is facing fundamental normative uncertainty or where there are multiple instantiations of the same agent program.
36
The options on the table include causal decision theory (in a variety of flavors) and evidential decision theory, along with newer candidates such as “timeless decision theory” and “updateless decision theory,” which are still under development.
37
It may prove difficult to identify and articulate the correct decision theory, and to have justified confidence that we have got it right. Although the prospects for directly specifying an AI’s decision theory are perhaps more hopeful than those of directly specifying its final values, we are still confronted with a substantial risk of error. Many of the complications that might break the currently most popular decision theories were discovered only recently, suggesting that there might exist further problems that have not yet come into sight. The result of giving the AI a flawed decision theory might be disastrous, possibly amounting to an existential catastrophe.
In view of these difficulties, one might consider an indirect approach to specifying the decision theory that the AI should use. Exactly how to do this is not yet clear. We might want the AI to use “that decision theory
D
which we would have wanted it to use had we thought long and hard about the matter.” However, the AI would need to be able to make decisions before learning what
D
is. It would thus need some effective interim decision theory
D’
that would govern its search for
D
. One might try to define
D’
to be some sort of superposition of the AI’s current hypotheses about
D
(weighed by their probabilities), though there are unsolved technical problems with how to do this in a fully general way.
38
There is also cause for concern that the AI might make irreversibly bad decisions (such as rewriting itself to henceforth run on some flawed decision theory) during the learning phase, before the AI has had the opportunity to determine which particular decision theory is correct. To reduce the risk of derailment during this period of vulnerability we might instead try to endow the seed AI with some form of
restricted rationality
: a deliberately simplified but hopefully dependable decision theory that staunchly ignores esoteric considerations, even ones we think may ultimately be legitimate, and that is designed to replace itself with a more sophisticated (indirectly specified) decision theory once certain conditions are met.
39
It is an open research question whether and how this could be made to work.
A project will also need to make a fundamental design choice in selecting the AI’s epistemology, specifying the principles and criteria whereby empirical hypotheses are to be evaluated. Within a Bayesian framework, we can think of the epistemology as a prior probability function—the AI’s implicit assignment of probabilities to possible worlds before it has taken any perceptual evidence into account. In other frameworks, the epistemology might take a different form; but in any case some inductive learning rule is necessary if the AI is to generalize from past observations and make predictions about the future.
40
As with the goal content and the decision theory, however, there is a risk that our epistemology specification could miss the mark.
One might think that there is a limit to how much damage could arise from an incorrectly specified epistemology. If the epistemology is
too
dysfunctional, then the AI could not be very intelligent and it could not pose the kind of risk discussed in this book. But the concern is that we may specify an epistemology that is sufficiently sound to make the AI instrumentally effective in most situations, yet which has some flaw that leads the AI astray on some matter of crucial importance. Such an AI might be akin to a quick-witted person whose worldview is predicated on a false dogma, held to with absolute conviction, who consequently “tilts at windmills” and gives his all in pursuit of fantastical or harmful objectives.
Certain kinds of subtle difference in an AI’s prior could turn out to make a drastic difference to how it behaves. For example, an AI might be given a prior that assigns zero probability to the universe being infinite. No matter how much
astronomical evidence it accrues to the contrary, such an AI would stubbornly reject any cosmological theory that implied an infinite universe; and it might make foolish choices as a result.
41
Or an AI might be given a prior that assigns a zero probability to the universe not being Turing-computable (this is in fact a common feature of many of the priors discussed in the literature, including the Kolmogorov complexity prior mentioned in
Chapter 1
), again with poorly understood consequences if the embedded assumption—known as the “Church–Turing thesis”—should turn out to be false. An AI could also end up with a prior that makes strong metaphysical commitments of one sort or another, for instance by ruling out a priori the possibility that any strong form of mind–body dualism could be true or the possibility that there are irreducible moral facts. If any of those commitments is mistaken, the AI might seek to realize its final goals in ways that we would regard as perverse instantiations. Yet there is no obvious reason why such an AI, despite being fundamentally wrong about one important matter, could not be sufficiently instrumentally effective to secure a decisive strategic advantage. (Anthropics, the study of how to make inferences from indexical information in the presence of observation selection effects, is another area where the choice of epistemic axioms could prove pivotal.
42
)
We might reasonably doubt our ability to resolve all foundational issues in epistemology in time for the construction of the first seed AI. We may, therefore, consider taking an indirect approach to specifying the AI’s epistemology. This would raise many of the same issues as taking an indirect approach to specifying its decision theory. In the case of epistemology, however, there may be greater hope of benign convergence, with any of a wide class of epistemologies providing an adequate foundation for safe and effective AI and ultimately yielding similar doxastic results. The reason for this is that sufficiently abundant empirical evidence and analysis would tend to wash out any moderate differences in prior expectations.
43
A good aim would be to endow the AI with fundamental epistemological principles that match those governing our own thinking. Any AI diverging from this ideal is an AI that we would judge to be reasoning incorrectly if we consistently applied our own standards. Of course, this applies only to our
fundamental
epistemological principles. Non-fundamental principles should be continuously created and revised by the seed AI itself as it develops its understanding of the world. The point of superintelligence is not to pander to human preconceptions but to make mincemeat out of our ignorance and folly.
The final item in our list of design choices is
ratification
. Should the AI’s plans be subjected to human review before being put into effect? For an oracle, this question is implicitly answered in the affirmative. The oracle outputs information; human reviewers choose whether and how to act upon it. For genies, sovereigns, and tool-AIs, however, the question of whether to use some form of ratification remains open.
To illustrate how ratification might work, consider an AI intended to function as a sovereign implementing humanity’s CEV. Instead of launching this AI directly, imagine that we first built an oracle AI for the sole purpose of answering questions about what the sovereign AI would do. As earlier chapters revealed, there are risks in creating a superintelligent oracle (such as risks of mind crime or infrastructure profusion). But for purposes of this example let us assume that the oracle AI has been successfully implemented in a way that avoided these pitfalls.
We thus have an oracle AI that offers us its best guesses about the consequences of running some piece of code intended to implement humanity’s CEV. The oracle may not be able to predict in detail what would happen, but its predictions are likely to be better than our own. (If it were impossible even for a superintelligence to predict
anything
about the code would do, we would be crazy to run it.) So the oracle ponders for a while and then presents its forecast. To make the answer intelligible, the oracle may offer the operator a range of tools with which to explore various features of the predicted outcome. The oracle could show pictures of what the future looks like and provide statistics about the number of sentient beings that will exist at different times, along with average, peak, and lowest levels of well-being. It could offer intimate biographies of several randomly selected individuals (perhaps imaginary people selected to be probably representative). It could highlight aspects of the future that the operator might not have thought of inquiring about but which would be regarded as pertinent once pointed out.
Being able to preview the outcome in this manner has obvious advantages. The preview could reveal the consequences of an error in a planned sovereign’s design specifications or source code. If the crystal ball shows a ruined future, we could scrap the code for the planned sovereign AI and try something else. A strong case could be made that we should familiarize ourselves with the concrete ramifications of an option before committing to it, especially when the entire future of the race is on the line.
What is perhaps less obvious is that ratification also has potentially significant disadvantages. The irenic quality of CEV might be undermined if opposing factions, instead of submitting to the arbitration of superior wisdom in confident expectation of being vindicated, could see in advance what the verdict would be. A proponent of the morality-based approach might worry that the sponsor’s resolve would collapse if all the sacrifices required by the morally optimal were to be revealed. And we might all have reason to prefer a future that holds some surprises, some dissonance, some wildness, some opportunities for self-overcoming—a future whose contours are not too snugly tailored to present preconceptions but provide some give for dramatic movement and unplanned growth. We might be less likely to take such an expansive view if we could cherry-pick every detail of the future, sending back to the drawing board any draft that does not fully conform to our fancy at that moment.
The issue of sponsor ratification is therefore less clear-cut than it might initially seem. Nevertheless, on balance it would seem prudent to take advantage of an opportunity to preview, if that functionality is available. But rather than letting
the reviewer fine-tune every aspect of the outcome, we might give her a simple veto which could be exercised only a few times before the entire project would be aborted.
44
The main purpose of ratification would be to reduce the probability of catastrophic error. In general, it seems wise to aim at minimizing the risk of catastrophic error rather than at maximizing the chance of every detail being fully optimized. There are two reasons for this. First, humanity’s cosmic endowment is astronomically large—there is plenty to go around even if our process involves some waste or accepts some unnecessary constraints. Second, there is a hope that if we but get the initial conditions for the intelligence explosion approximately right, then the resulting superintelligence may eventually home in on, and precisely hit, our ultimate objectives. The important thing is to land in the right attractor basin.
With regard to epistemology, it is plausible that a wide range of priors will ultimately converge to very similar posteriors (when computed by a superintelligence and conditionalized on a realistic amount of data). We therefore need not worry about getting the epistemology
exactly
right. We must just avoid giving the AI a prior that is so extreme as to render the AI incapable of learning vital truths even with the benefit of copious experience and analysis.
45
With regard to decision theory, the risk of irrecoverable error seems larger. We might still hope to directly specify a decision theory that is good enough. A superintelligent AI could switch to a new decision theory at any time; however, if it starts out with a sufficiently wrong decision theory it may not see the reason to switch. Even if an agent comes to see the benefits of having a different decision theory, the realization might come too late. For example, an agent designed to refuse blackmail might enjoy the benefit of deterring would-be extortionists. For this reason, blackmailable agents might do well to proactively adopt a non-exploitable decision theory. Yet once a blackmailable agent receives the threat and regards it as credible, the damage is done.
Given an adequate epistemology and decision theory, we could try to design the system to implement CEV or some other indirectly specified goal content. Again there is hope of convergence: that different ways of implementing a CEV-like dynamic would lead to the same utopian outcome. Short of such convergence, we may still hope that many of the different possible outcomes are good enough to count as existential success.