Read Superintelligence: Paths, Dangers, Strategies Online
Authors: Nick Bostrom
Tags: #Science, #Philosophy, #Non-Fiction
In the example, we imagined the subagents as emulations. One might wonder, does the institution design approach require that the subagents be anthropomorphic? Or is it equally applicable to systems composed of artificial subagents?
One’s first thought here might be skeptical. One notes that despite our plentiful experience with human-like agents, we still cannot precisely predict the outbreak
or outcomes of revolutions; social science can, at most, describe some statistical tendencies.
37
Since we cannot reliably predict the stability of social structures for ordinary human beings (about which we have much data), it is tempting to infer that we have little hope of precision-engineering stable social structures for cognitively enhanced human-like agents (about which we have no data), and that we have still less hope of doing so for advanced artificial agents (which are not even similar to agents that we have data about).
Yet the matter is not so cut-and-dried. Humans and human-like beings are complex; but artificial agents could have relatively simple architectures. Artificial agents could also have simple and explicitly characterized motivations. Furthermore, digital agents in general (whether emulations or artificial intelligences) are copyable: an affordance that may revolutionize management, much like interchangeable parts revolutionized manufacturing. These differences, together with the opportunity to work with agents that are initially powerless and to create institutional structures that use the various abovementioned control measures, might combine to make it possible to achieve particular institutional outcomes—such as a system that does not revolt—more reliably than if one were working with human beings under historical conditions.
But then again, artificial agents might lack many of the attributes that help us predict the behavior of human-like agents. Artificial agents need not have any of the social emotions that bind human behavior, emotions such as fear, pride, and remorse. Nor need artificial agents develop attachments to friends and family. Nor need they exhibit the unconscious body language that makes it difficult for us humans to conceal our intentions. These deficits might destabilize institutions of artificial agents. Moreover, artificial agents might be capable of making big leaps in cognitive performance as a result of seemingly small changes in their algorithms or architecture. Ruthlessly optimizing artificial agents might be willing to take extreme gambles from which humans would shrink.
38
And superintelligent agents might show a surprising ability to coordinate with little or no communication (e.g. by internally modeling each other’s hypothetical responses to various contingencies). These and other differences could make sudden institutional failure more likely, even in the teeth of what seem like Kevlar-clad methods of social control.
It is unclear, therefore, how promising the institution design approach is, and whether it has a greater chance of working with anthropomorphic than with artificial agents. It might be thought that creating an institution with appropriate checks and balances could only increase safety—or, at any rate, not reduce safety—so that from a risk-mitigation perspective it would always be best if the method were used. But even this cannot be said with certainty. The approach adds parts and complexity, and thus may also introduce new ways for things to go wrong that do not exist in the case of an agent that does not have intelligent subagents as parts. Nevertheless, institution design is worthy of further exploration.
39
Goal system engineering is not yet an established discipline. It is not currently known how to transfer human values to a digital computer, even given human-level machine intelligence. Having investigated a number of approaches, we found that some of them appear to be dead ends; but others appear to hold promise and deserve to be explored further. A summary is provided in
Table 12
.
Table 12
Summary of value-loading techniques
| |
---|---|
Explicit representation | May hold promise as a way of loading domesticity values. Does not seem promising as a way of loading more complex values. |
Evolutionary selection | Less promising. Powerful search may find a design that satisfies the formal search criteria but not our intentions. Furthermore, if designs are evaluated by running them—including designs that do not even meet the formal criteria—a potentially grave additional danger is created. Evolution also makes it difficult to avoid massive mind crime, especially if one is aiming to fashion human-like minds. |
Reinforcement learning | A range of different methods can be used to solve “reinforcement-learning problems,” but they typically involve creating a system that seeks to maximize a reward signal. This has an inherent tendency to produce the wireheading failure mode when the system becomes more intelligent. Reinforcement learning therefore looks unpromising. |
Value accretion | We humans acquire much of our specific goal content from our reactions to experience. While value accretion could in principle be used to create an agent with human motivations, the human value-accretion dispositions might be complex and difficult to replicate in a seed AI. A bad approximation may yield an AI that generalizes differently than humans do and therefore acquires unintended final goals. More research is needed to determine how difficult it would be to make value accretion work with sufficient precision. |
Motivational scaffolding | It is too early to tell how difficult it would be to encourage a system to develop internal high-level representations that are transparent to humans (while keeping the system’s capabilities below the dangerous level) and then to use those representations to design a new goal system. The approach might hold considerable promise. (However, as with any untested approach that would postpone much of the hard work on safety engineering until the development of human-level AI, one should be careful not to allow it to become an excuse for a lackadaisical attitude to the control problem in the interim.) |
Value learning | A potentially promising approach, but more research is needed to determine how difficult it would be to formally specify a reference that successfully points to the relevant external information about human value (and how difficult it would be to specify a correctness criterion for a utility function in terms of such a reference). Also worth exploring within the value learning category are proposals of the Hail Mary type or along the lines of Paul Christiano’s construction (or other such shortcuts). |
Emulation modulation | If machine intelligence is achieved via the emulation pathway, it would likely be possible to tweak motivations through the digital equivalent of drugs or by other means. Whether this would enable values to be loaded with sufficient precision to ensure safety even as the emulation is boosted to superintelligence is an open question. (Ethical constraints might also complicate developments in this direction.) |
Institution design | Various strong methods of social control could be applied in an institution composed of emulations. In principle, social control methods could also be applied in an institution composed of artificial intelligences. Emulations have some properties that would make them easier to control via such methods, but also some properties that might make them harder to control than AIs. Institution design seems worthy of further exploration as a potential value-loading technique. |
If we knew how to solve the value-loading problem, we would confront a further problem: the problem of deciding which values to load. What, in other words, would we want a superintelligence to want? This is the more philosophical problem to which we turn next.
Suppose we could install any arbitrary final value into a seed AI. The decision as to which value to install could then have the most far-reaching consequences. Certain other basic parameter choices—concerning the axioms of the AI’s decision theory and epistemology—could be similarly consequential. But foolish, ignorant, and narrow-minded that we are, how could we be trusted to make good design decisions? How could we choose without locking in forever the prejudices and preconceptions of the present generation? In this chapter, we explore how indirect normativity can let us offload much of the cognitive work involved in making these decisions onto the superintelligence itself while still anchoring the outcome in deeper human values.
How can we get a superintelligence to do what we want? What do we want the superintelligence to want? Up to this point, we have focused on the former question. We now turn to the second question.
Suppose that we had solved the control problem so that we were able to load any value we chose into the motivation system of a superintelligence, making it pursue that value as its final goal. Which value should we install? The choice is no light matter. If the superintelligence obtains a decisive strategic advantage, the value would determine the disposition of the cosmic endowment.
Clearly, it is essential that we not make a mistake in our value selection. But how could we realistically hope to achieve errorlessness in a matter like this? We might be wrong about morality; wrong also about what is good for us; wrong even about what we truly want. Specifying a final goal, it seems, requires making one’s way through a thicket of thorny philosophical problems. If we try a direct approach, we are likely to make a hash of things. The risk of mistaken choosing is especially
high when the decision context is unfamiliar—and selecting the final goal for a machine superintelligence that will shape all of humanity’s future is an extremely unfamiliar decision context if any is.
The dismal odds in a frontal assault are reflected in the pervasive dissensus about the relevant issues in value theory. No ethical theory commands majority support among philosophers, so most philosophers must be wrong.
1
It is also reflected in the marked changes that the distribution of moral belief has undergone over time, many of which we like to think of as progress. In medieval Europe, for instance, it was deemed respectable entertainment to watch a political prisoner being tortured to death. Cat-burning remained popular in sixteenth-century Paris.
2
A mere hundred and fifty years ago, slavery still was widely practiced in the American South, with full support of the law and moral custom. When we look back, we see glaring deficiencies not just in the behavior but in the moral beliefs of all previous ages. Though we have perhaps since gleaned some moral insight, we could hardly claim to be now basking in the high noon of perfect moral enlightenment. Very likely, we are still laboring under one or more grave moral misconceptions. In such circumstances to select a final value based on our current convictions, in a way that locks it in forever and precludes any possibility of further ethical progress, would be to risk an existential moral calamity.
Even if we could be rationally confident that we have identified the correct ethical theory—which we cannot be—we would still remain at risk of making mistakes in developing important details of this theory. Seemingly simple moral theories can have a lot of hidden complexity.
3
For example, consider the (unusually simple) consequentialist theory of hedonism. This theory states, roughly, that all and only pleasure has value, and all and only pain has disvalue.
4
Even if we placed all our moral chips on this one theory, and the theory turned out to be right, a great many questions would remain open. Should “higher pleasures” be given priority over “lower pleasures,” as John Stuart Mill argued? How should the intensity and duration of a pleasure be factored in? Can pains and pleasures cancel each other out? What kinds of brain states are associated with morally relevant pleasures? Would two exact copies of the same brain state correspond to twice the amount of pleasure?
5
Can there be subconscious pleasures? How should we deal with extremely small chances of extremely great pleasures?
6
How should we aggregate over infinite populations?
7
Giving the wrong answer to any one of these questions could be catastrophic. If by selecting a final value for the superintelligence we had to place a bet not just on a general moral theory but on a long conjunction of specific claims about how that theory is to be interpreted and integrated into an effective decision-making process, then our chances of striking lucky would dwindle to something close to hopeless. Fools might eagerly accept this challenge of solving in one swing all the important problems in moral philosophy, in order to infix their favorite answers into the seed AI. Wiser souls would look hard for some alternative approach, some way to hedge.
This takes us to indirect normativity. The obvious reason for building a superintelligence is so that we can offload to it the instrumental reasoning required to find effective ways of realizing a given value. Indirect normativity would enable us also to offload to the superintelligence some of the reasoning needed to select the value that is to be realized.
Indirect normativity is a way to answer the challenge presented by the fact that we may not know what we truly want, what is in our interest, or what is morally right or ideal. Instead of making a guess based on our own current understanding (which is probably deeply flawed), we would delegate some of the cognitive work required for value selection to the superintelligence. Since the superintelligence is better at cognitive work than we are, it may see past the errors and confusions that cloud our thinking. One could generalize this idea and emboss it as a heuristic principle:
The principle of epistemic deference
A future superintelligence occupies an epistemically superior vantage point: its beliefs are (probably, on most topics) more likely than ours to be true. We should therefore defer to the superintelligence’s opinion whenever feasible.
8
Indirect normativity applies this principle to the value-selection problem. Lacking confidence in our ability to specify a concrete normative standard, we would instead specify some more abstract condition that any normative standard should satisfy, in the hope that a superintelligence could find a concrete standard that satisfies the abstract condition. We could give a seed AI the final goal of continuously acting according to its best estimate of what this implicitly defined standard would have it do.
Some examples will serve to make the idea clearer. First we will consider “coherent extrapolated volition,” an indirect normativity proposal outlined by Eliezer Yudkowsky. We will then introduce some variations and alternatives, to give us a sense of the range of available options.
Yudkowsky has proposed that a seed AI be given the final goal of carrying out humanity’s “coherent extrapolated volition” (CEV), which he defines as follows:
Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.
9
When Yudkowsky wrote this, he did not purport to present a blueprint for how to implement this rather poetic prescription. His aim was to give a preliminary sketch of how CEV might be defined, along with some arguments for why an approach along these lines is needed.
Many of the ideas behind the CEV proposal have analogs and antecedents in the philosophical literature. For example, in ethics
ideal observer theories
seek to analyze normative concepts like “good” or “right” in terms of the judgments that a hypothetical ideal observer would make (where an “ideal observer” is defined as one that is omniscient about non-moral facts, is logically clear-sighted, is impartial in relevant ways and is free from various kinds of biases, and so on).
10
The CEV approach, however, is not (or need not be construed as) a moral theory. It is not committed to the claim that there is any necessary link between value and the preferences of our coherent extrapolated volition. CEV can be thought of simply as a useful way to approximate whatever has ultimate value, or it can be considered aside from any connection to ethics. As the main prototype of the indirect normativity approach, it is worth examining in a little more detail.
Some terms in the above quotation require explication. “Thought faster,” in Yudkowsky’s terminology, means
if we were smarter and had thought things through more
. “Grown up farther together” seems to mean
if we had done our learning, our cognitive enhancing, and our self-improving under conditions of suitable social interaction with one another
.
“Where the extrapolation converges rather than diverges” may be understood as follows. The AI should act on some feature of the result of its extrapolation only insofar as that feature can be predicted by the AI with a fairly high degree of confidence. To the extent that the AI cannot predict what we would wish if we were idealized in the manner indicated, the AI should not act on a wild guess; instead, it should refrain from acting. However, even though many details of our idealized wishing may be undetermined or unpredictable, there might nevertheless be some broad outlines that the AI can apprehend, and it can then at least act to ensure that the future course of events unfolds within those outlines. For example, if the AI can reliably estimate that our extrapolated volition would wish that we not all be in constant agony, or that the universe not be tiled over with paperclips, then the AI should act to prevent those outcomes.
11
“Where our wishes cohere rather than interfere” may be read as follows. The AI should act where there is fairly broad agreement between individual humans’ extrapolated volitions. A smaller set of strong, clear wishes might sometimes outweigh the weak and muddled wishes of a majority. Also, Yudkowsky thinks that it should require less consensus for the AI to
prevent
some particular narrowly specified outcome, and more consensus for the AI to act to funnel the future into some particular narrow conception of the good. “The initial dynamic for CEV,” he writes, “should be conservative about saying ‘yes,’ and listen carefully for ‘no.’”
12
“Extrapolated as we wish that extrapolated, interpreted as we wish that interpreted”: The idea behind these last modifiers seems to be that the rules for extrapolation should themselves be sensitive to the extrapolated volition. An individual might have a second-order desire (a desire concerning what to desire) that some
of her first-order desires not be given weight when her volition is extrapolated. For example, an alcoholic who has a first-order desire for booze might also have a second-order desire not to have that first-order desire. Similarly, we might have desires over how various other parts of the extrapolation process should unfold, and these should be taken into account by the extrapolation process.
It might be objected that even if the concept of humanity’s coherent extrapolated volition could be properly defined, it would anyway be impossible—even for a superintelligence—to find out what humanity would actually want under the hypothetical idealized circumstances stipulated in the CEV approach. Without some information about the content of our extrapolated volition, the AI would be bereft of any substantial standard to guide its behavior. However, although it would be difficult to know with precision what humanity’s CEV would wish, it is possible to make informed guesses. This is possible even today, without superintelligence. For example, it is more plausible that our CEV would wish for there to be people in the future who live rich and happy lives than that it would wish that we should all sit on stools in a dark room experiencing pain. If
we
can make at least some such judgments sensibly, so can a superintelligence. From the outset, the superintelligence’s conduct could thus be guided by its estimates of the content of our CEV. It would have strong instrumental reason to refine these initial estimates (e.g. by studying human culture and psychology, scanning human brains, and reasoning about how we might behave if we knew more, thought more clearly, etc.). In investigating these matters, the AI would be guided by its initial estimates of our CEV; so that, for instance, the AI would not unnecessarily run myriad simulations replete with unredeemed human suffering if it estimated that our CEV would probably condemn such simulations as mind crime.
Another objection is that there are so many different ways of life and moral codes in the world that it might not be possible to “blend” them into one CEV. Even if one could blend them, the result might not be particularly appetizing—one would be unlikely to get a delicious meal by mixing together all the best flavors from everyone’s different favorite dish.
13
In answer to this, one could point out that the CEV approach does not require that all ways of life, moral codes, or personal values be blended together into one stew. The CEV dynamic is supposed to act only when our wishes cohere. On issues on which there is widespread irreconcilable disagreement, even after the various idealizing conditions have been imposed, the dynamic should refrain from determining the outcome. To continue the cooking analogy, it might be that individuals or cultures will have different favorite dishes, but that they can nevertheless broadly agree that aliments should be nontoxic. The CEV dynamic could then act to prevent food poisoning while otherwise allowing humans to work out their culinary practices without its guidance or interference.