Are we all misaligned?
This text was originally posted on LessWrong.
The orthogonality thesis separates intelligence and goals, constraining the notion of intelligence to instrumental rationality and allowing for any combination of general intelligence and a goal system. And the other way around, such a description should apply to any possible agency. For any agency, it should be possible to divide it into two parts losslessly. Many arguments against the thesis have been constructed and proven incoherent, showing its robustness and broad applicability. Still, there also exist valid critiques of its assumptions, such as one by John Danaher. The following text discusses the human mind in the context of the orthogonality thesis and touches on the evolutionary origin of the human brain.
Let us start with an assertion: 'some intelligences and goal systems are irreducibly connected.' This protest appears intuitive since it comes from dissonance and confusion experienced when one attempts to apply the orthogonality thesis to analyze oneself. Self-inquiry of the human mind produces wildly different ideas than ones arising from observation of a Reinforcement Learning agent. We do not perceive experienced pleasure as our final goal, nor any other alternative metric appears obvious. Satisfaction seems to be based on changing goals. Core values, which we once considered ultimate, are now irrelevant. When someone recommends you a book, saying 'this will change you as a person,' you do not worry about your goal preservation.
One counterexample would be a religious fanatic refusing to read the book, being afraid that it will make him doubt his beliefs. There exist mechanisms in humans that give very high inertia to their opinions, constructing defense mechanisms against acquiring a new perspective. Furthermore, by definition, if one decides to change his core values in light of new information, the core values never were one's final goal. But what is the final goal, then? There appears a stark contrast between that experience of confusion and the boldness of the orthogonality thesis.
And, the other way around, there is an intuitive feeling that the thesis does not constitute a complete picture of intelligent agency. One could point to the holistic self-reference and, furthermore, self-referential goal-setting, as necessary qualities of general intelligence. Whereas the former can be explained away as a form of instrumental rationality, the latter would not only reject the orthogonality thesis but also show that the RL problem and the AGI problem are separate. The previous assertion that not every agency description can be reduced to two disconnected subspaces of agent parameter space is extended by 'those of agents whose intelligence and goal-system are disconnected are not really generally intelligent agents.' Not only is it not possible to fully describe humans by orthogonality thesis, but perceived self-determination at the core of this impossibility is a necessary condition for true General Intelligence.
To explore this contradiction, it has to be addressed in the context of the human mind's origins. The process that gave birth to the biological neural network that constitutes the central nervous system is evolution; indeed, all naturally occurring examples of intelligence result from the evolutionary heuristic applied to biological organisms. Whereas heuristics are usually something artificially implemented to a system, evolution in the physical world can be thought of as fundamental law of nature emergent from laws of physics and the genetic nature of Earth's organisms. In general, those specimens that are better adapted to survival will survive and therefore be given a chance at procreation. The population will adapt to have better fitness, a metric consisting of survival and procreation abilities. The evolutionary entities will evolve thicker shells, stronger muscles, and better immune systems.
Another plane of improvements on which the same process occurs is information-processing and decision-making ability: from bare life functions to basic sensory capabilities, to lizard brain, to mammal brain, to tool-making and the emergence of social roles. The evolutionary entity develops the brain. That introduces agency to the organism. Whereas the emergent goal of the evolutionary entity is to improve its fitness, the brain organ that it deploys can have goals wholly different. There appears a misalignment between an entity in the evolution process and the new intelligence—human as a specimen of Homo Sapiens against the human mind. This dichotomy can be illustrated by an area-specific application of the Darwinian Demon paradox.
Darwinian Demon is a hypothetical organism with fitness maximized disregarding constraints arising from available variation and physiology. It is the global maxima of fitness in the organism space. It reproduces instantly after being born, produces infinitely many offspring, and lives indefinitely. That is, as soon, as many, and as long as it is possible within the physical reality.
In this version of the thought experiment, the assumption is that the physical characteristics remain frozen, and only the mental capabilities are subject to evolutionary change in a world that the organism cannot substantially transform. It is the evolution of intelligence ceteris paribus. The Darwinian Demonic Agency would be perfectly aligned to maximize the fitness of the species. To such mind, it would be self-evident that the highest purpose to its existence is procreation and self-preservation, and it would be as well intellectually equipped to that task as possible. It would be an immortality-seeking sex superintelligence.
A trivial example of inherent misalignment between the Demon and humans we are is suffering aversion as an end in itself, manifesting in either epicureanism, negative utilitarianism, or other stances. This category was chosen for its universality: most organisms avoid suffering. From evolution's perspective, only the suffering that decreases expected species fitness matters (injuries, illnesses, rejection by a sexual partner). Humans also feel severe discontent and painful heartache from other origins. They feel homesick, even if their homes do not boast the ability to either procreate or advance their longevity, and staying at home would decrease their species' expected fitness. But they might decide that returning home is preferable nevertheless, even after acknowledging its gene-wise suboptimality. Going further than suffering aversion, it is not that homesickness would necessarily be very straining. The person in question probably did not put any weight on species fitness.
Exploring the misalignment further, an extremely talented person, instead of deciding to donate sperm in an attempt at upgrading the species gene-pool, might feel depressed because of insufficient freedom in their life. Freedom is far from being necessary for species fitness; one might suggest that a dystopian system enforcing eugenics might yield better results. Yet, the talented person might abandon any romantical activities and consider suicide, which in this case is a total denial of the guiding principle of the process of evolution. An ideologically-motivated mad scientist might even plot the extinction of humanity! The crucial fact is that those agents do not fail to understand that their actions do not contribute to the species' fitness. They might be well aware of that, yet they decide to act towards a goal directly contrary to one of the evolution process that gave birth to them. Evolution created misaligned agencies: human minds.
Darwinian Demon point on organism space is unobtainable by the evolutionary heuristic within constraints of physical reality. Each mutation is a trade-off, and many areas are cut off by insanely low fitness at their boundaries. While evolution seems to have much better coverage of search space than gradient descent, as researchers of Open-Endedness noticed, it will not overpower all physiological constraints. There might be a high evolutionary cost of alignment, or at limited computational power an aligned intelligence might be suboptimal. If humans were too dumb to realize their ruthless pursuit of procreation destroys their species, it would be more optimal to have them believe in virtues that serve as the basis for a sustainable society: justice, fairness, industriousness.
Perhaps the only sustainable society obtainable by evolution is one based on such virtues, and any intelligent species rejecting them goes extinct. Groups of humans that became ruthless sex-crazed conquerors might have fared worse as a collective since founding a community necessary for survival on these foundations would require thinking too far into the future. On the other hand, groups believing in justice, voluntarily engaging in culture, and feeling impulse for altruism, even when it is irrational and undesirable both for the specimen and the species, would naturally form sustainable communities. Successful collective would not be dependent on successful reasoning, but rather, implanting such misalignments makes sustainable communities the mindless, default option.
Furthermore, if, by simulations or contact with alien species, it was shown that fairness and similar traits result from convergent evolution of all intelligent life, arising everywhere independently, it would give them even more innate character. A strong sense of fairness could be cosmologically universal and vital for the development of all civilizations, even not carbon-based. This is, of course, pure speculation. It might just as well be revealed that the intuitive virtues of humanity are absolute contingency and evolutionarily suboptimal ones.
Rather than originating as an obvious agent with a clearly defined goal system, the human mind evolved as a neural circuit to optimize the fitness of the organism it inhabits, with various constraints limiting both its abilities and alignment. It might consist of neural sections with contradictory goals and not strive to maximize (or satisfy) any identifiable utility function. Instead, its actions, in general, should contribute to the fitness of the species, as fundamentally misaligned as they might sometimes be. Homicidal tendencies might arise from a dreading feeling of social exclusion, but that feeling is necessary for species and overwhelmingly positive when all human agencies are aggregated. The complexity of such a system could serve as an example of a neural network with irreducibly interrelated intelligence and the goal system. Another suggestion arising is that viewing the human goal system in the context of its misalignment to the evolutionary process might be very insightful.