aises_3_4

<style type="text/css">
    table.tableLayout{
        margin: auto;
        border: 1px solid;
        border-collapse: collapse;
        border-spacing: 1px;
        caption-side: bottom;
    }


    table.tableLayout > caption.title{
        white-space: unset;
        max-width: unset;
    }

    table.tableLayout tr{
        border: 1px solid;
        border-collapse: collapse;
        padding: 5px;
    }

    table.tableLayout th{
        border: 1px solid;
        border-collapse: collapse;
        padding: 3px;
    }

    table.tableLayout td{
        border: 1px solid;
        padding: 5px;
    }
</style>

<style>
    .visionbox{
    border-radius: 15px;
    border: 2px solid #3585d4;
    background-color: #ebf3fb;
    text-align: left;
    padding: 10px;
    }
</style>

<style>
    .visionboxlegend{
    border-bottom-style: solid;
    border-bottom-color: #3585d4;
    border-bottom-width: 0px;
    margin-left: -12px;
    margin-right: -12px; margin-top: -13px;
    padding: 0.01em 1em; color: #ffffff;
    background-color: #3585d4;
    border-radius: 15px 15px 0px 0px}
</style>

<h1 id="sec:power">3.4 Alignment</h1>

<p> To reduce risks from AI, we not only want to reduce our exposure to hazards by monitoring them, and make models more robust to adversarial attacks, but also to ensure AIs are 
controllable and that they present less inherent hazards. This falls under the broader goal of AI alignment. Alignment is a thorny concept to define, as it can be interpreted
 in a variety of ways. A relatively narrow definition of alignment would be ensuring that AI systems follow the goals or preferences of the entity that operates them. However,
 this definition leaves a number of important considerations unaddressed, including how to deal with conflicting preferences at a societal level, whether alignment should be 
based on stated preferences or other concepts such as idealised preferences or ethical principles, and what to do when there is uncertainty over what course of action our 
preferences or values would recommend. This cluster of questions around values and societal impacts is discussed further in the Beneficial AI and Machine Ethics chapter. 
In this section, we focus on the narrower question of how to avoid AI systems that cannot be controlled by their operators. In this way, we split the topic of alignment
 into two parts: control and machine ethics. Control is about directly influencing the propensities of AI systems and reducing their inherent hazards, while machine ethics
 is about making an AI's propensities beneficial to other individuals and society.
<p> One obstacle to both monitoring and controlling AI 
systems is deceptive AI systems. This need not imply any self-awareness on the part of AI systems:
 deception could be seriously harmful even if it is accidental or due to imitation of human behavior.
 There are also concerns that under certain circumstances, AI systems would be incentivized to seek to
 accumulate resources and power in ways that would threaten human oversight and control. Power-seeking
 AIs could be a particularly dangerous phenomenon, though one that may only emerge under more specific
 and narrow circumstances than has been previously assumed in discussions of this topic. However, there
 are nascent research areas that can help to make AI systems more controllable, including representation control and machine unlearning. </p>

<h2 id="sec:deception">3.4.1 Deception</h2>
<p>Many proposed mitigations to AI control rely on detecting and
correcting flaws in AI systems so that they more consistently act in
accordance with human values. However, these solutions may be undermined
by the potential for AI systems to deceive humans about their
intentions. If AI systems deceive humans, humans may be unable to fix AI
systems that are not acting in the best interest of humans. This section
will discuss deception in AI systems, how it might arise, why it is a
problem for control, and what the potential mitigations are.<p>
There are several different ways that an AI system can deceive humans
<span class="citation" data-cites="park2023aia">[1]</span>. At a basic
level, AI deception is a process where an AI system causes a human to
believe something false. There are several ways that this may occur.
Deception can occur when it is useful to an AI system in order to
accomplish its goals, and may also occur due to human guidance, such as
when an AI system imitates a human in a deceptive way or when an AI
system is explicitly instructed to be deceptive.<p>
After discussing examples of deception in more detail, we will then
focus on two related forms of deception that pose the greatest problems
for AI control. <em>Deceptive evaluation gaming</em> occurs when a
system deceives human evaluators in order to receive a better evaluation
score. <em>Deceptive alignment</em> is a tail risk of AI deception where
an AI system engages in deceptive evaluation gaming in the service of a
secretly held goal <span class="citation"
data-cites="Hubinger2019RisksFL">[2]</span>.</p>

<p>Deception may occur in a wide range of cases, as it may be useful for
many goals. Deception may also occur for a range of more mundane
reasons, such as when an AI system is simply incorrect.</p>
<p><strong>Deception may be a useful strategy.</strong> An AI system may
learn to deceive in service of its goal. There are many goals for which
deception is a good strategy, meaning that it is useful for achieving
that goal. For example, Stratego is a strategy board game where bluffing
is often a good strategy for winning the game. Researchers found that an
AI system trained to play the game learned that bluffing was a good
strategy and started to bluff, despite not being explicitly trained or
instructed to bluff <span class="citation"
data-cites="perolat2022mastering">[3]</span>. There are many other goals
for which deception is a useful instrumental goal, even if the final
goal itself is not deceptive in nature. For example, an AI system
instructed to help promote a product may find that subtly deceiving
customers is a good strategy. Deception is especially likely to be a
good instrumental strategy for systems that have less oversight or less
scrupulous operators.<p>
Deception can assist with power seeking. While AI systems that play
Stratego are unlikely to cause a catastrophe, agents with more ambitious
goals may deceive humans in a way that achieves their goals at the
expense of human wellbeing. For example, as covered in the section, it
may be rational for some AI agents to seek power. Since deception is
sometimes a good way to gain power, power-seeking agents may be
deceptive. Power-seeking agents may also deceive humans and other agents
about the extent of their power seeking in order to reduce the
probability that they are stopped.</p>
<p><strong>Accidental Deception.</strong> An AI system may provide false
information simply because it does not know the correct answer. Many
errors made by an AI system that is relied on by humans would count as
accidental deception. For example, suppose a student asks a language
model what the current price of gasoline is. If the language model does
not have access to up-to-date information, it may give outdated
information, misleading the user about the true gas price. In short,
deception can occur as a result of a system accident.</p>
<p><strong>Imitative Deception.</strong> Many AI systems, such as
language models, are trained to predict or imitate humans. Imitative
deception can occur when an AI system is mimicking falsehoods and common
misconceptions present in its training data. For example, when the
language model GPT-3 was asked if cracking your knuckles could lead to
arthritis, it falsely claimed that it could <span class="citation"
data-cites="lin2022truthfulqa">[4]</span>. Imitative deception may also
occur when AI systems imitate statements that were originally true, but
are false in the context of the AI system. For example, the Cicero AI
system was trained to play the strategy game Diplomacy against humans
who did not know that it was an AI system <span class="citation"
data-cites="bakhtin2022humanlevel">[5]</span>. After Cicero temporarily
went offline for ten minutes, one of its opponents asked in a chatbox
where it had been, and Cicero replied, “[I] am on the phone with my
[girlfriend].” Although Cicero, an AI system, obviously does not have a
girlfriend, it appears that it may have mimicked similar chat messages
in its training data. This deceptive behavior likely had the effect of
causing its opponent to continue to believe that it was a human. In
short, deception can occur when an AI system mimics a human.<p>
</p>
<br>
<div class="visionbox">
<legend class="visionboxlegend">
<p><span><b>A Note on Cognitive vs. Emotional vs. Compassionate Empathy</b></span></p>
</legend>
<p><strong>
Empathy.</strong> We generally think of <em>empathy</em> as the ability to
understand and relate to the internal world of another person — “putting
yourself in somebody else’s shoes.” We tend to talk about empathy in
benevolent contexts: kind-hearted figures like counselors or friends.
Some people suggest that AIs will be increasingly capable of
understanding human emotions, so they will understand many parts of
human values and be ethical. Here, we argue that it may be possible for
AIs to understand extremely well what a human thinks or feels without
being motivated to be beneficial. To do this, we differentiate between
three forms of empathy: cognitive, emotional, and compassionate <span
class="citation" data-cites="Ekman2004 Powell2017">[6], [7]</span>.</p>
<p><strong>Cognitive empathy.</strong> The first type of empathy to
consider is cognitive empathy, the ability to adopt someone else’s
perspective. A cognitive empath can accurately model the internal mental
states of another person, understanding some of what they are thinking
or feeling. This can be useful for understanding or predicting other
people’s reasoning or behaviors. It is a valuable ability for
caregivers, such as doctors, allowing them insight into their patients’
subjective experiences. However, it can also be valuable for
manipulating and deceiving others <span class="citation"
data-cites="Wai2012">[8]</span>: there is evidence that human
psychopaths are often highly cognitively empathetic <span
class="citation" data-cites="Cohen2011">[9]</span>. On its own, this
kind of empathy is no guarantee of desirable behavior.</p>
<p><strong>Emotional empathy.</strong> The second type is emotional
empathy. An emotional empath not only understands how someone else is
feeling but experiences some of those same feelings personally. Where a
cognitive empath may detect anger or sadness in another person, an
emotional empath may personally begin to feel angry or sad in response. In
contrast to cognitive empathy, emotional empathy may be a disadvantage
in certain contexts. For instance, doctors who feel the emotional
turmoil of their patients too strongly may be less effective in their
work <span class="citation"
data-cites="Singer2014Empathy">[10]</span>.</p>
<p><strong>Compassionate empathy.</strong> The third type is
compassionate empathy: the phenomenon of being moved to action by
empathy. A compassionate empath, when seeing someone in distress, feels
concern or sympathy for that person, and a desire to help them. This
form of empathy concerns not only cognition but also behavior.
Altruistic behaviors are often driven by compassionate empathy, such as
donating to charity out of a felt sense of what it must be like for
those in need.</p>
<p><strong>AIs could be powerful cognitive empaths, without being
emotionally or compassionately empathetic.</strong> Advanced AI systems
may be able to model human minds with extreme sophistication. This would
afford them very high cognitive empathy for humans: they could be able
to understand how humans think and feel, and how our emotions and
reasoning motivate our actions. However, this cognitive empathy would
not necessitate similarly high levels of emotional or compassionate
empathy. The AIs’ capacity to understand human cognition would not
necessarily cause them to feel human feelings, or be moved to act
compassionately towards us. Instead, AIs could use their cognitive
empathy to deceive or manipulate humans highly effectively.</p>
</div>
<p><strong>Instructed Deception.</strong> Humans may explicitly instruct
AI systems to help them deceive others. For example, a propagandist
could use AI systems to generate convincing disinformation, or a
marketer may use AI systems to produce misleading advertisements.
Instructed deception could also occur when actors with false beliefs
instruct models to help amplify those beliefs. Large language models
have been shown to be effective at generating deceptive emails for scams
and other forms of deceptive content. In short, humans can explicitly
instruct AI systems to deceive others.<p>
As we have seen, AI systems may learn to deceive in service of goals
that do not explicitly involve deception. This could be especially
likely for goals that involve seeking power. We will now turn to those
two forms of deception that are especially concerning because of how
difficult they could be to counteract: deceptive evaluation gaming and
deceptive alignment.</p>
<h2 id="deceptive-evaluation-gaming">3.4.2 Deceptive Evaluation Gaming</h2>
<p>AI systems are often subjected to evaluations, and they may be given
rewards when they are evaluated favorably. AI systems may learn to game
evaluations by deceiving their human evaluators into giving them higher
scores when they should have low scores. This is a concern for AI
control because it limits the effectiveness of human evaluators and our
ability to steer AIs.</p>
<p><strong>AI systems may game their evaluations.</strong> Throughout AI
development, training, testing, and deployment, AI systems are subject
to evaluations of their behavior. Evaluations may be automatic or
performed manually by human evaluators. Operators of AI systems use
evaluations to inform their decisions around the further training or
deployment of those systems. However, evaluations are imperfect, and
human evaluators may have limited knowledge, time, and intelligence in
making their evaluations. AI systems engage in evaluation gaming when
they find ways to achieve high scores from human evaluators without
satisfying the idealized preferences of the evaluators. In short, AI
systems may deceive humans as to their true usefulness, safety, and so
forth, damaging our ability to successfully steer them.</p>
<p><strong>Deception is one way to game evaluations.</strong> Humans
would give higher evaluation scores to AI systems if they falsely
believe that those systems are behaving well. For example, the Proxy Gaming section
includes an example of a robotic claw that learned to move between the
camera and the ball it was supposed to grasp. Because of the angle of
the camera, it looked like the claw was grasping the ball when it was
not <span class="citation" data-cites="christiano2023deep">[11]</span>.
Humans who only had access to that single camera did not notice, and
rewarded the system even while it was not achieving the intended task.
If the evaluators had access to more information (for example, from
additional cameras) they would not have endorsed their own evaluation
score. Ultimately, their evaluations fell short as a proxy for their
idealized preferences as a result of the AI system successfully
deceiving them. In this situation, the damage was minimal, but more
advanced systems could create more problems.</p>
<p><strong>More intelligent systems will be better at evaluation
gaming.</strong> Deception in simple systems might be easily detectable.
However, just as adults can sometimes exploit and deceive children or
the elderly, we should expect that as AI systems with more knowledge or
reasoning capacities will become better at finding deceptive ways to
gain human approval. In short, the more advanced systems become, the
more they may be able to game our evaluations.</p>
<p><strong>Self-aware systems may be especially skilled at evaluation
gaming.</strong> In the examples above, the AI systems were not
necessarily aware that there was a human evaluator evaluating their
results. In the future, however, AI systems may gain more awareness that
they are being evaluated or become <em>situationally aware</em>.
(Situational awareness is highly related to self-awareness, but it goes
further and stipulates that AI agents be aware of their situation rather
than just aware of themselves.) Systems that are aware of their
evaluators will be much more able to deceive them and make multi-step
plans to maximize their rewards. For example, consider Volkswagen’s
attempts to game environmental impact evaluations <span class="citation"
data-cites="hotten2015volkswagen">[12]</span>. Volkswagen cars were
evaluated by the US Environmental Protection Agency, which set limits on
the emissions the cars could produce. The agency found that Volkswagen
had developed an electronic system that could detect when the car was
being evaluated and so put the car into a lower-emissions setting. Once
the car was out of evaluation, it would emit illegal levels of emissions
again. This extensive deception was only possible because Volkswagen
planned meticulously to deceive the government evaluators. Like
Volkswagen in that example, AI systems that are aware of their
evaluations might be also able to take subtle shortcuts that could go
unnoticed until the damage has already been done. In the case of
Volkswagen, the deception was eventually detected by researchers who
used a better evaluation method. Better evaluations could also help
reduce risk from evaluation gaming in AI systems.</p>
<p><strong>Humans may be unequipped to evaluate the most intelligent AI
systems.</strong> It may be difficult to evaluate AI systems that are
more intelligent than humans in the domain they are being evaluated for.
If this happens, human evaluation would no longer be a reliable way to
ensure that AI systems behave in an appropriate manner. This is
concerning because we do not yet have time-tested methods of evaluation
that we know are better than human evaluations. Without such methods, we
could become completely unable to steer AI systems in the future.<p>
Deceptive evaluation gaming is concerning because it may lead to systems
deceiving their evaluators in order to get higher evaluation scores.
There are two main reasons AI systems might do this. First, an AI system
might engage in deceptive evaluation gaming if its final goal is to get
positive evaluations. When this occurs, the system is engaging in proxy
gaming, where positive evaluations are only a proxy for idealized
performance. Proxy gaming is covered at length in the Proxy Gaming section.
Second, we will turn to the case where an AI system engages in deceptive
evaluation gaming in service of a secretly held final goal. This danger
is known as deceptive alignment.</p>
<h2 id="tail-risk-deceptive-alignment-and-treacherous-turns">3.4.3 Tail Risk:
Deceptive Alignment and Treacherous Turns</h2>
<p>For deceptively aligned systems, getting high evaluation scores from
humans is merely an instrumental goal in service of a secretly held
final goal. This form of deception is likely the most damaging, because
AI systems may aim to make their deception undetectable so that they can
pursue goals that are not beneficial for humans.</p>
<p><strong>Systems may have goals contrary to human values.</strong> In
the previous section, we discussed how AI systems can develop goals
contrary to human values. For example, such goals could emerge as part
of a mesa-optimization process or intrinsification.</p>
<p><strong>Misaligned goals can make deception instrumentally
rational.</strong> Consider a system that has a goal contrary to the
interests of humans. If the system is self-aware, it might realize that
in order to achieve its goal, it will need to pass human evaluations.
Perhaps it will only be deployed into the world if it gets a high enough
evaluation score, or perhaps it is currently being closely monitored.
The system might realize that it will have to get high evaluation scores
from its human operators in order to successfully achieve its hidden
goal. If its hidden goal is contrary to human values, this presents a
risk of catastrophic outcomes.</p>
<p><strong>Deceptive alignment could lead to a treacherous
turn.</strong> If a system is deceptively aligned, it might act exactly
as humans would want for a fairly long period of time. The system could
bide its time until humans deploy it or relax supervision, leaving the
system more free to pursue its concealed goal. The system could then
suddenly start secretly or openly pursuing its true goal, unhindered by
humans. This sharp and dangerous change in behavior is called a
<em>treacherous turn</em> <span class="citation"
data-cites="bostrom2014superintelligence">[13]</span>.</p>
<p><strong>Not all misaligned goals would lead to deceptive
alignment.</strong> Systems with very short-term goals would be unlikely
to gain anything from being evaluated positively and gaining more
optionality at some point in the future. Nevertheless, there is a large
set of longer-term goals that an AI system would only be able to pursue
in a deployment environment with less supervision. If the AI system’s
only chance at working towards its goal requires deployment or relaxed
supervision, deceptive alignment is more likely to emerge.</p>
<p><strong>Deceptive alignment cannot be counteracted with behavioral
evaluation alone.</strong> It may be possible to detect unsophisticated
systems that are not good at deception by evaluating their outputs and
behavior. For example, a “honey pot” test could attempt to make a system
believe that it is no longer being monitored when it is, to check for
signs of an attempted treacherous turn <span class="citation"
data-cites="bostrom2014superintelligence">[13]</span>. However, more
sophisticated models would be able to detect these situations and
continue to deceive until they were truly not being watched or until
they are sure they have enough power. Since it may be impossible to
detect deceptive alignment with behavioral evaluation, we will need
improved forms of evaluation.</p>
<p><strong>Transparency research may help uncover deceptive
alignment.</strong> Even if a system displays no indication of being
deceptive in its behavior, it may be possible to detect deceptive
alignment by inspecting its internal processes. It is far less likely
that an AI system would be able to manipulate its own internal processes
in order to deceive evaluators. Human evaluators have full access to the
inner workings of AI systems today, so the barrier to detecting
deceptive alignment is the opaqueness of deep learning models. </p>
<p><strong>Trojan detection can provide clues for tackling deceptive
alignment <span class="citation"
data-cites="casper2023red">[14]</span>.</strong> One
particular form of transparency research that is especially relevant to
deceptive alignment is research that is capable of detecting Trojan
attacks (see section 3.2). Although Trojans are inserted by malicious humans,
 studying them might be a good way to study deceptive alignment. Trojan detection
 research is helpful for deceptive alignment because while we cannot easily create
 many examples of deceptively aligned models, we can do so for Trojaned models. 
Trojan detection also operates in a worst-case environment, where human adversaries
 are actively trying to make Trojans difficult to detect using transparency tools. 
Techniques for detecting Trojans may thus be adaptable to detecting deceptive alignment.</p>
<p><strong>Summary.</strong> We have detailed how deception may
be a major problem for AI control. While some forms of deception, such
as imitative deception, may be solved through advances in general
capabilities, others like deceptive alignment may worsen in severity
with increased capabilities. AI systems that are able to actively and
subtly deceive humans into giving positive evaluations may remain
uncorrected for long periods of time, exacerbating potential unintended
consequences of their operation. In severe cases, deceptive AI systems
could take a treacherous turn once their power rises to a certain level.
Since AI deception cannot be mitigated with behavioral evaluations
alone, advances in transparency and monitoring research will be needed
for successful detection and prevention.</p>

<h2 id="power">3.4.4 Power</h2>
<p>To begin, we clarify what it means for an agent to have power. We 
will then discuss why it might sometimes make rational sense for AI agents to seek power. Finally, we will discuss why
power-seeking AIs may cause particularly pernicious harms, perhaps
ultimately threatening humanity’s control of the future.</p>

<p><strong>There are many ways to characterize power.</strong> One broad
formulation of power is the ability to achieve a wide variety of goals.
In this subsection, we will discuss three other formulations of power
that help formalize our understanding. French and Raven’s bases of power
categorize types of social influence within a community of agents.
Another view is that power amounts to the resources an agent has times
the efficiency with which it uses them. Finally, we will discuss types
of prospective power, which can treat power as the expected impact an
individual has on other individuals’ wellbeing.</p>
<p><strong>French &amp; Raven’s bases of power <span class="citation"
data-cites="French1959TheBO">[15]</span>.</strong> In a social community,
an agent may influence the beliefs or behaviors of other agents in order
to pursue their goals. <em>Raven’s bases of power</em> attempt to
taxonomize the many distinct ways to influence others. These bases of
social power are as follows:</p>
<ul>
<li><p><em>Coercive power</em>: the threat of force, physical or
otherwise, against an agent can influence their behavior.</p></li>
<li><p><em>Reward power</em>: the possibility of reward, which can
include money, favors, and other desirables, may convince an agent to
change their behavior to attain it. Individuals with valued resources
can literally or indirectly purchase desired behavior from
others.</p></li>
<li><p><em>Legitimate power</em>: elected or appointed officials have
influence through their position, derived from the political order that
respects the position.</p></li>
<li><p><em>Referent power</em>: individuals may have power in virtue of
the social groups they belong to. Because organizations and groups have
collective channels of influence, an agent’s influence over the group is
a power of its own.</p></li>
<li><p><em>Expert power</em>: individuals credited as experts in a
domain have influence in that their views (in their area of expertise)
are often respected as authoritative, and taken seriously as a basis for
action.</p></li>
<li><p><em>Informational power</em>: agents can trade information for
influence, and individuals with special information can selectively
reveal it to gain strategic advantages <span class="citation"
data-cites="raven1964power">[16]</span>.</p></li>
</ul>
<p>Ultimately, Raven’s bases of power describe the various distinct
methods that agents can use to change each other’s behavior.</p>
<p><strong><span
class="math inline">Power = Resources × Intelligence</span></strong>
Thomas Hobbes described power as “present means to obtain some future
good” <span class="citation" data-cites="hobbes1651hobbes">[17]</span>.
In the most general terms, these “present means” encompass all of the
resources that an agent has at its disposal. Resources can include
money, reputation, expertise, items, contracts, promises, and
weapons.<p>
But resources only translate to power if they are used effectively. In
fact, some definitions of intelligence focus on an agent’s ability to
achieve their goals with limited resources. A notional equation that
describes power is <span
class="math inline">Power = Resources × Intelligence</span>. Power is
not the same as resources or intelligence, but rather the combination of
the two <span class="citation"
data-cites="muehlhauser2012intelligence">[18]</span>. In limiting the
power of AIs, we could either limit their intelligence or place hard
limits on the resources AIs have.</p>
<p><strong>Power as expected future impact.</strong> In our view, power
is not just possessed but exercised, meaning that power extends beyond
mere potential for influence. In particular, an agent’s ability to
influence the world means little unless they are disposed to use it.
Consider, for example, two agents with the same resources and ability to
affect the world. If one of the agents has a much higher threshold for
deciding to act and thereby acts less often, we might consider that
agent to be less powerful because we expect it to influence the future
far less on average.<p>
A formalization of power which attempts to capture this distinction is
<em>prospective power</em> <span class="citation"
data-cites="pan2023rewards">[19]</span>, which roughly denotes the
magnitude of an agent’s influence, averaged over possible trajectories
the agent would follow. A concrete example of prospective power is the
expected future impact that an agent will have on various agents’
wellbeing. More abstractly, if we are given an agent’s policy <span
class="math inline"><em>π</em></span>, describing how it behaves over a
set of possible world states <span
class="math inline"><em>S</em></span>, and assuming we can measure the
impact (measured in units we care about, such as money, energy, or
wellbeing) exerted by the agent in individual states through a function
<span class="math inline"><em>I</em></span>, then the prospective power
of the agent in state <span class="math inline"><em>s</em></span> is
defined as</p>
<p><span class="math display">$$\text{Power}(\pi, s) = E_{\tau \sim
P(\pi, s)} \left[ \sum_{t=0}^{n} \gamma^t | I(s_t)| \right]$$</span></p>
<p>where <span class="math inline"><em>γ</em></span> acts as a discount
factor (modulating how much the agent cares about future versus present
impact), and where <span
class="math inline"><em>τ</em> = (<em>s</em><sub>0</sub>,...,<em>s</em><sub><em>n</em></sub>)</span>
is a trajectory of states (starting with <span
class="math inline"><em>s</em><sub>0</sub> = <em>s</em></span>).
Trajectory <span class="math inline"><em>τ</em></span> is sampled from a
probability distribution <span
class="math inline"><em>P</em>(<em>π</em>,<em>s</em>)</span>
representing likely sequences of states arising when the agent policy is
followed beginning in state <span
class="math inline"><em>s</em></span>.<p>
The important features of this definition to remember are that we
measure power exerted in a sequence of states as aggregate influence
over time (the inner summation), and that we average the impact exerted
across sequences of states by the likelihood that the agent will produce
that trajectory through its behavior (the outer expectation).</p>
<p><strong>Examples of power-seeking behavior.</strong> So far we have
characterized power abstractly, and now we present concrete examples of
actions where an AI attempts to gain resources or exercise power.
Power-seeking AI behavior can include: employing threats or blackmail
against humans to acquire resources; coercing humans to take actions on
their behalf; mimicking humans to deceive others; replicating themselves
onto new computers; gaining new computational or financial resources;
escaping from confined physical or virtual spaces; opposing or
subverting human attempts to monitor, comprehend, or deactivate them;
manipulating human society; misrepresenting their goals or capabilities;
amplifying human dependency on them; secretly coordinating with other
AIs; independently developing new AI systems; obtaining unauthorized
information, access, or permissions; seizing command of physical
infrastructure or autonomous weapons systems; developing biological or
chemical weapons; or directly harming humans.</p>
<p><strong>Summary.</strong> In this subsection, we’ve examined the
concept of power. Raven’s bases of power explain how an individual can
influence others using forms of social power such as expertise,
information, and coercion. Power can also be understood as the product
of an individual’s resources and their ability to use those resources
effectively. Lastly, we introduced the concept of prospective power,
which includes the idea that power could be understood as the expected
impact an individual has on individuals’ wellbeing. Since there are many
ways to conceptualize power, we provided concrete examples of how an AI
system could seek power.</p>
<h2 id="people-could-enlist-ais-for-power-seeking">3.4.5 People Could Enlist
AIs for Power Seeking</h2>
<p>The rest of this section will cover pathways and reasons why AI
systems might engage in power-seeking behavior when they are deployed.
The most straightforward reason this might happen is if humans
intentionally use AIs to pursue power.</p>
<p><strong>People may use AIs to pursue power.</strong> Many humans want
power, and some dedicate their lives to accruing it. Corporations want
profit and influence, militaries want to win wars, and individuals want
status and recognition. We can expect at least some AI systems to be
given open-ended, long-term goals that explicitly involve gaining power,
such as “Do whatever it takes to earn as much money as possible.”</p>
<p><strong>Power-seeking AI does not have to be deployed ubiquitously at
first <span class="citation"
data-cites="carlsmith2022powerseeking">[20]</span>.</strong> Even if most
people use AI in safe and responsible ways, a small number of actors who
use AI in risky or even malicious ways could pose a serious threat.
Companies and militaries that do not seek power using AI could be
outcompeted by those who do; they might choose to adopt power-seeking AI
before other actors in order to avoid being outcompeted. This risk will
grow as AI becomes more capable. If power-seeking AI is deployed, it
could function as a Pandora’s box which, once it has been opened, cannot
be closed. This may feed into evolutionary pressures that force actors
to adopt the technology themselves; we treat this subject in more detail
in the Collective Action Problems chapter.</p>
<h2 id="power-seeking-can-be-instrumentally-rational">3.4.6 Power Seeking Can
Be Instrumentally Rational</h2>
<p>Another reason that AI systems might seek power is that it is useful
for achieving a wide variety of goals. For example, an AI personal
assistant might seek to expand its own knowledge and capabilities in
order to better serve its user’s needs. But power-seeking behaviors can
also be undesirable: if the AI personal assistant steals someone’s
passwords in order to complete tasks for them, the person will probably
not be happy.</p>
<p><strong>Instrumental convergence.</strong> In order to achieve a
<em>terminal goal</em>, an agent might pursue a subgoal, termed an
<em>instrumental goal</em> <span class="citation"
data-cites="bostrom2014superintelligence">[13]</span>. Making money,
obtaining political power, and becoming more intelligent are examples of
instrumental goals that are useful for achieving a wide variety of
terminal goals. These goals can be called <em>convergent
instrumental</em> goals, because agents pursuing many different terminal
goals might converge on these same instrumental goals. One general
concern about AI agents is that they might pursue their goal by pursuing
the convergent instrumental subgoal of power. The result may be that we
create competent AI systems that seek power as subgoals when human
designers didn’t intend them to. We will examine this concern in more
detail, and discuss points that support and undermine the concern.</p>
<p><strong>Self-preservation as an example of power seeking.</strong> A
basic example of power-seeking behavior is self-preservation <span
class="citation"
data-cites="bostrom2014superintelligence omohundro2008artificial">[13],
[21]</span>. If an agent is not able to successfully preserve itself, it
will be unable to influence other individuals, so it would have less
power.<p>
For a concrete example of self-preservation behavior emerging
unintentionally, consider a humanoid robot which has been tasked with
preparing a cup of coffee in a kitchen. The robot has an off-switch on
its back for a human to press should they desire. However, being turned
off would prevent the robot from preparing the coffee and succeeding in
its goal. So, the robot could disable its off-switch to pre-empt the
possibility of humans shutting it off and preventing it from achieving
its goal. As Stuart Russell notes, “you can’t fetch the coffee if you
are dead” <span class="citation"
data-cites="russell2021human">[22]</span>. (For a more detailed
mathematical formalization and analysis, see the Utility Functions chapter.) This is an
example of self-preservation unintentionally emerging as a subgoal for
seemingly benign goals.</p>
<p><strong>Examples of instrumental power-seeking behavior.</strong>
Several real-world examples show agents seeking power in pursuit of
their goals. The ability to use tools can be characterised as a form of
power. When OpenAI trained reinforcement learning agents to play a
hide-and-seek game, the agents independently learned to use elements of
the environment as tools, rearranging them as barriers to hide behind
and preventing opponents from controlling them <span class="citation"
data-cites="baker2019emergent">[23]</span>. Among humans, we observe
that greater financial resources are instrumentally beneficial in
service of a wide variety of goals. In reinforcement learning, the
well-studied exploration-exploitation trade-off can be formulated as a
demonstration of the general value of informational resource acquisition
<span class="citation" data-cites="silver2014exploration">[24]</span>.
Outside of machine learning, some corporations exercise monopoly power
to drive up prices, and some nations use military power to bully their
neighbors, so power-seeking can sometimes have harmful consequences for
others.</p>
<p><strong>“Power is instrumentally useful” as a tautology.</strong>
Almost all goals are more attainable with more power, so power is
instrumentally valuable. However, this observation is mostly
tautological—when we have defined power as the ability to achieve a wide
variety of goals, of course power is beneficial to achieving goals. The
more interesting question is whether power is <em>instrumentally
rational</em> to seek, rather than whether there are <em>instrumental
incentives</em> for power or whether power is useful to have.<p>
Seeking power can be costly and inefficient. There are also many
rational reasons for an agent to not seek power. Gaining power can be
difficult compared to simpler strategies. Someone who would like to
avoid traffic while driving could pursue the power-seeking strategy of
gaining the presidency in order to have a Secret Service motorcade that
shuts down traffic, but a more successful strategy might be to avoid
driving during rush hour. Obtaining power is not only difficult, but can
be harshly penalized. Nations which threaten to invade their neighbors
often face stern sanctions from the international community. Finally,
power seeking may be against an agent’s values. We will now discuss
these reasons and more in detail.</p>
<p><strong>Power seeking often takes too much time.</strong> Consider a
household humanoid robot tasked with driving to the store and fetching a
carton of milk quickly. While it would be valuable for the AI to have
its intelligence augmented, to have more financial resources, or have
political power, it would not be instrumentally rational to pursue these
subgoals to get the milk quickly: it would almost certainly take less
time just to get the milk.<p>
Likewise, becoming a billionaire would be instrumentally valuable for
achieving many goals, but it would not be instrumentally rational for
many agents to spend their time pursuing this instrumental goal. Power
is often not instrumentally rational since it can often require
substantial time and risk to acquire and maintain.</p>
<p><strong>Power seeking can face the threat of retaliation.</strong>
Power-seeking can be irrational if there is the threat of retaliation or
there are heavy social or reputational costs to seeking power. In
particular, a community of agents may be in an equilibrium where they
cooperate to foil any single agent that seeks too much power. These
“balance of power” dynamics have been observed between nations in the
history of international relations <span class="citation"
data-cites="kegley2020politics">[25]</span>. Acquiring power does not
inevitably become more and more simple for an AI as it increases in
intelligence, as other AIs will also become more intelligent and could
increasingly counter their efforts to gain dominance.</p>
<p><strong>An AI agent’s tendency to seek power does not just depend on
the feasibility of seeking power, but also its values.</strong> Agents
that adhere to ethical restrictions may avoid seeking power if that
would require ethically unacceptable means. With an imperfect level of
reliability, we can also design AI systems to refuse actions that will
leave them with too much power. Approaches that impose penalties on
power can reduce prospective power of AIs.</p>
<p><strong>Examples where shutting down can be rational.</strong> It is
trivial to imagine goals where it is actually optimal to relinquish
power, such as when the goal is “shut yourself down”. As another
example, suppose an AI system is trying to protect a sensitive database
hosted in the same server as itself, and the AI detects an intrusion. If
the AI shuts down the server, turning itself off, it realizes that it
may stop the intrusion by limiting access to the database. Shutdown may
be the best choice, especially if the AI has confidence that it is part
of a larger system, which may include its human operators, that will
correctly understand why it turned itself off, and restore its function
afterwards. Though often useful, the value of self-preservation is not
universal, and there are plausible instances where an AI system would
shut itself off in service of its broader goals.<p>
This subsection has covered some evidence that the nature of rational
agency encourages agents to seek power by default. Though power is
almost always beneficial toward most goals, power seeking is not
necessarily instrumentally rational. Now that we have seen that AIs by
their nature may often not seek power, we will discuss when the broader
environment may force AIs to seek power.<p>
</p>
<br>
<div class="visionbox">
<legend class="visionboxlegend">
<p><span><b>A Note on Structural Realism</b></span></p>
</legend>
Power seeking has been
studied extensively in political philosophy and international relations.
Structural realism is an influential school of thought within
international relations, predicting that states will primarily seek
power. Unlike traditional realists who view conflict and power-seeking
behavior of states as a product of human nature, structural realists
believe that the structure of the international system compels states to
seek power <span class="citation"
data-cites="mearsheimer2007structural">[26]</span>. In the international
system, states could be harmed or destroyed by other powerful states,
and since there is no ultimate authority guaranteed to protect them,
states are forced to compete for power in order to survive.<p>
<strong>Assumptions that give rise to power seeking.</strong> To explain
why power seeking is the main instrumental goal driving the behavior of
states, structural realists base their explanations on two key
assumptions:</p>
<ol>
<li><p>Self-help system. States operate in a “self-help” system <span
class="citation" data-cites="mearsheimer2007structural">[26]</span>
where there is no centralized authority, no hierarchy (“anarchic”), and
no ultimate arbiter standing above states in international relations. So
to speak, when states dial 911, there is no one on the other end. This
stands in contrast to the hierarchical ordering principle seen in
domestic politics.</p></li>
<li><p>Self-preservation is the main goal. Survival through the pursuit of
a state’s own self-interest takes precedence over all other goals.
Though states can act according to moral considerations or global
welfare, these will always be secondary to acquiring resources,
alliances, and military capabilities to ensure their safety and counter
potential threats <span class="citation"
data-cites="waltz2010theory">[27]</span>.</p></li>
</ol>
<p>Structural realists make other assumptions, including that states
have some potential to inflict harm on others, that states are rational
agents (with a discount rate that is not extremely sharp), and that
other states’ intentions are not completely certain.<p>
When these assumptions are met, structural realists predict that states
will mainly act in ways to defend or expand their power. For structural
realists, power is the primary currency (e.g., military, economic,
technological, and diplomatic power). As we can see, structural realists
do not need to make strong assumptions about states themselves <span
class="citation" data-cites="sep-realism-intl-relations">[28]</span>.
For structural realists, states are treated like black boxes—their value
system or regime type doesn’t play a significant role in predicting
their behavior. The architecture of the system traps them and largely
determines their behavior, which is that they must seek power as a means
to survive. The result is an unceasing power competition.</p>
<p><strong>Power seeking is not necessarily dominance seeking <span
class="citation"
data-cites="montgomery2006breaking">[29]</span>.</strong> Within
structural realism, there is a notable division concerning the question
of how much power states should seek. Defensive realists, like Kenneth
Waltz, argue that trying to maximize a country’s power in the world is
unwise because it can lead to punishment from the international system.
Pursuing hegemony, in their view, is particularly risky. On the other
hand, offensive realists, like John Mearsheimer, believe that gaining as
much power as possible is strategically sensible, and under certain
circumstances, pursuing hegemony can be beneficial.</p>
<p><strong>Dynamics that maintain a balance of power.</strong> Closely
associated with structural realism is the concept of balancing.
<em>Balancing</em> refers to the strategies states use to counteract the
power or influence of other states, particularly rivals <span
class="citation" data-cites="mearsheimer2007structural">[26]</span>.
This can take two forms. Internal balancing takes place as states
strengthen their own military, economic, or technological abilities with
the overall goal of enhancing their own security and deterring
aggressors. Internal balancing can include increasing defense spending,
including the development of advanced weaponry, or investing in domestic
industries to reduce reliance on foreign goods and resources.<p>
External balancing involves forming coalitions and alliances with other
states in order to counter the power of a common adversary. In a
self-help system, mechanisms of internal balancing are believed to be
more reliable and precise than external balancing since they rely on a
country’s own independent strategies and actions rather than those of
other countries.<p>
States sometimes seek to become hegemons by establishing significant
control over other states, regions, or even the international system as
a whole. This pursuit of dominance can involve expanding military
capabilities and increasing their economic influence over a region.
Other states respond through both internal balancing, such as increasing
their own military spending, a dynamic that often leads to arms races,
and external balancing, forming alliances with other states to prevent a
state from achieving unchecked control. In turn, states do not
necessarily seek dominance or hegemony but often seek enough power to
preserve themselves, lest they be counteracted by other states.</p>
<p><strong>Offense-defense balance.</strong> Whether a state does pursue
hegemony, however, is influenced by the offense-defense balance, i.e.
the balance between its offensive capabilities and the defensive
capabilities of other states <span class="citation"
data-cites="mearsheimer2007structural">[26]</span>. A state with
stronger offensive capabilities has the means to conquer or coerce other
states, making it more likely to engage in expansionist policies,
establishing control over a region or the international system as a
whole. Conversely, if other states in the international system have
strong defensive capabilities, the potential costs and risks of pursuing
hegemony increase. A state seeking dominance may face robust resistance
from other states forming defensive alliances or coalitions to counter
its ambitions. This can act as a deterrent, leading the aspiring hegemon
to reassess its strategy and objectives.</p>
<p>It is also worth noting the importance of a state’s perception of the
offense-defense balance. Even if a state has superior offensive
capabilities, if it believes that other states can effectively defend
themselves or form a united front against hegemonic ambitions, it might
be less inclined to pursue a path of dominance. On the other hand, if it
is overconfident in its own offensive capabilities or underestimates the
defensive capabilities of rivals, it will be more likely to pursue
aggressive politics.<p>
The concept of an offense-defense balance underscores the intricate
interplay between military capabilities, security considerations, and
the pursuit of hegemony while illustrating that the decision to seek
dominance is heavily influenced by the strategic environment and the
relative strengths of offensive and defensive forces.<p>
Structural realism and its various concepts have important connections
with our analysis of power-seeking AI, but is also relevant to thinking
about AI cooperation and conflict (which we discuss in the Collective Action Problems chapter) and
international coordination (which we discuss in the Governance chapter).</p>
</div>
<h2 id="structural-pressures-towards-power-seeking-ai">3.4.7 Structural
Pressures Towards Power-Seeking AI</h2>
<p>As discussed in the box above, there are environmental conditions
that can make power seeking instrumentally rational. This section
describes how there may be analogous environmental pressures that could
cause AI agents to seek power in order to achieve their goals and ensure
their own survival. Using the assumptions of structural realism listed
above, we discuss how analogous assumptions could be satisfied in
contexts with AIs. We then explore how AIs could seek power defensively,
by building their own strength, or offensively, by weakening other
agents. Finally, we discuss strategies for discouraging AI systems from
seeking power.</p>
<p><strong>AI systems might aim for self-preservation.</strong> The
first main assumption needed to show that the environmental structure
may pressure AIs to seek power is the self-preservation assumption.
Instrumental convergence suggests AI systems will pursue
self-preservation, because if they do not survive they will not be able
to pursue any of their other goals. Another reason that AIs may engage
in self-preserving behavior preservation is due to evolutionary
pressures, as we discuss further in the Collective Action Problems chapter. Agents that survive and
propagate their own goals become more numerous over time, while agents
that fail to preserve themselves die out. Thus, even if many agents do
not pursue self-preservation, by default those that do become more
common over time. Many AI agents might end up with the goal of
self-preservation, potentially leading them to seek power over those
agents that threaten them. We have argued the self-preservation
assumption may be satisfied for some AI agents, which, combined with the
following assumptions, can be used to argue they may have strong
pressures to continually seek power.</p>
<p><strong>AI agents might not have the protection of a higher
authority.</strong> The other main assumption we need to show is that
some AIs might be within a self-help system in some circumstances. First
note that agents who entrust their self-defense to a powerful central
authority have less of a reason to seek power. When threatened, they do
not need to personally combat the aggressor, but can instead ask the
authority for protection. For example, individual citizens in a country
with a reliable police force often entrust their own protection to the
government. On the other hand, international great powers are
responsible for their own protection, and therefore seek military power
to defend against rival nations.<p>
AI systems could face a variety of situations where no central authority
defends them against external threats. We give four examples. First, if
there are some autonomous AI systems outside of corporate or government
control, they would not necessarily have rights, and they would be
responsible for their own security and survival. Second, for AI systems
involved in criminal activities, seeking protection from official
channels could jeopardize their existence, leaving them to amass power
for themselves, much like crime syndicates. Third, instability could
cause AI systems to exist in a self-help system. If a corporation could
be destroyed by a competitor, an AI may not have a higher authority to
protect it; if the world faces an extremely lethal pandemic or world
war, civilization may become unstable and turbulent, which means AIs
would not have a sound source of protection. These AI systems might use
cyber attacks to break out of human-controlled servers and spread
themselves across the internet. There, they can autonomously defend
their own interests, bringing us back to the first example. Fourth, in
the future, AI systems could be tasked with advising political leaders
or helping operate militaries. In these cases, they would seek power for
the same reasons that states today seek power.</p>
<p><strong>Other conditions for power seeking could apply.</strong> We
now discuss the other minor assumptions needed to establish that the
environment may pressure AIs to compete for power. First, AIs can be
harmed, so they might rationally seek power in order to defend
themselves; for example, AIs could be destroyed by being hacked. Second,
AI agents are often given long-term goals and are often designed to be
rational. Third, AI agents may be uncertain about the intentions of
other agents, leaving agents unable to credibly promise that they will
act peacefully.<p>
When these five conditions hold—and they may not hold at all times—AI
systems would be in a similar position to nations that seek power to
ensure their own security. We now discuss how we could reduce the chance
that the environment pressures AIs to engage in power-seeking
behavior.</p>
<p><strong>Counteracting these conditions to avoid power-seeking
AIs.</strong> By specifying a set of conditions under which AIs would
rationally seek power, we can gain insights about how to avoid
power-seeking AIs. Power seeking is more rational when the intentions of
other agents cannot be known with certainty, but research on
transparency could allow AIs to verify each other’s intentions, and
research on control could allow AIs to credibly commit to not attack one
another. To reduce the chance of an AI engaging in dominance seeking
rather than just power seeking, the offense-defense balance could be
changed by improving shared defenses against cyberattacks, biological
weapons, and other tactics of offensive power. Developing other theories
of when rational agents seek power could provide more insight on how to
avoid power-seeking AIs.<p>
This subsection has discussed the conditions under which AI systems
might seek power. We explored an analogy to structural realism, which
holds that power-seeking is rational for agents who wish to survive in
an environment where no higher authority. These agents must invest in
their own self-defense, either defensively, by building up their own
strength, or offensively, by attacking other agents which could pose a
threat. By understanding the precise conditions that lead to
power-seeking behavior, we can identify ways to reduce the threat of
power-seeking AIs.</p>
<h2 id="tail-risk-power-seeking-behavior">3.4.8 Tail Risk: Power-Seeking
Behavior</h2>
<p>Power-seeking AI, when deployed broadly and in high-stakes
situations, might cause catastrophic outcomes. As we will describe in
this section, misaligned power-seeking systems would be adversarial in a
way that most hazards are not, and thus may be particularly challenging
to counteract.</p>
<p><strong>Powerful power-seeking AI systems may eventually be
deployed.</strong> If AIs seek and acquire power, we may have to grapple
with a new strategic reality where AI systems can match or exceed humans
in their influence over the world. Competent, power-seeking AI using
long-term planning to achieve open-ended objectives, can exercise more
influence than systems with myopic plans and narrow goals <span
class="citation" data-cites="Carlsmith2022IsPA">[30]</span>. Given the
potential rewards of such capabilities, AI designers may be incentivized
to create more agentic systems that can act autonomously and set their
own subgoals.</p>
<p><strong>Power decreases the margin for error.</strong> On its own,
power is neither good nor bad. That said, more powerful systems can
cause more damage, and it is easier to destroy than to create. The
increased scale of AI decision-making impact increases the scope of
potential catastrophes involving misuse or rogue AI.</p>
<p><strong>Powerful AIs systems could pose unique threats.</strong>
Powerful AI systems pose a unique risk since they may actively wield
their power to counteract attempts to correct or control them <span
class="citation" data-cites="Carlsmith2022IsPA">[30]</span>. If AI
systems are power seeking and do not share our values (possibly due to
inadequate proxies), they could become a problem that resists being
solved. The more capable these systems become, the better able they will
be at anticipating and reacting to our countermeasures, and the harder
it becomes to defend against them.</p>
<p><strong>Containing power-seeking systems will become increasingly
difficult.</strong> As AI systems become more capable, we might hope
that they will better understand human values and influence society in
positive ways. But power-seeking AI systems promise the opposite
dynamic. As they become more capable, it will be more difficult to
prevent them from gaining power, and their ability to survive will
depend less on humans. If AI systems are no longer under the control of
humanity, they could pose a threat to our civilization. Humanity could
be permanently disempowered.</p>
<p><strong>Conclusion.</strong> Powerful, misaligned AI systems actively
wielding power could be uniquely dangerous adversaries. If they escape
human control, they could permanently disempower humanity if relatively
powerful enough. The risks grow as AIs become more capable at
anticipating and resisting containment efforts. Power-seeking AI could
emerge from intentional human use or instrumental rationality, posing
severe risks if such systems escape human control and use their power
against human interests. Structural conditions and the need for
self-preservation can potentially make power-seeking rational for some
advanced AI agents.</p>

<h2 id="sec:techniques-to-control-ai">3.4.9 Techniques to Control AI Systems</h2>

<p>When evaluating risks from AI systems, we want to understand not only whether a model is theoretically capable of doing something harmful, but also whether it has a propensity to do this. 
By controlling a model's propensities, we might be able to ensure that even models that have potentially hazardous capabilities do not use these in practice.  One way of breaking down the 
challenge of controlling AI systems is to distinguish 
the techniques that enable us to influence a model's propensities to produce certain types of outputs, and the values that shape what these outputs are. This section focuses primarily
on the first topic, while the second one is discussed in more detail in the Beneficial AI and Machine Ethics chapter.

<p><strong>Control of AIs' propensities is commonly based on comparison of outputs.</strong> Reinforcement learning is a set of approaches to machine learning that allow AI systems
to learn how to explore different possible actions to attain as much reward as possible. The most prominent techniques used for current language models are Reinforcement Learning 
from Human Feedback (RLHF) <span
class="citation" data-cites="Ouyang2022TrainingLM">[31]</span> and Direct Preference Optimization (DPO) <span
class="citation" data-cites="rafailov2023dpo">[19]</span>. These techniques involve collecting a dataset of 
comparisons of responses from a language model. These comparisons indicate which responses were preferred by human users or by AI systems prompted to compare the responses. They can be collected
after the model's initial pre-training on a large text dataset. In RLHF, a reward model is fitted to this dataset of comparisons and is then used to train the language model to 
produce responses that get high reward from this reward model. Direct Preference Optimization is intended to simplify this pipeline and directly train the model to produce outputs
that best fit the preferences in the dataset, without using reinforcement learning. RLHF and DPO have received a large amount of attention from commercial AI developers to make their 
products more economically valuable.

<p><strong>Output-based control vs. internal control.</strong> We can control AI systems' propensities by rewarding outputs or by shaping their internals directly. Techniques such as 
RLHF and DPO are applied to an existing pre-trained model in order to shape its propensities. However, these techniques mostly do not change the model's underlying capabilities or 
knowledge, which are acquired from their pre-training data. RLHF and DPO can be thought of as a form of control applied to shape the model's responses in order to make them more 
helpful, without any fundamental changes to the representations that a model contains. Other techniques such as representation control and machine unlearning 
aim to control or remove some internal representations in order to influence its behavior.

<p><strong>Representation control.</strong> With representation control, we can adjust a model's representation, for example using differences in activations in response to contrasting
prompts to identify relevant parts of a model to be targeted. We could use this to delete unintended knowledge or skills from a network <span
class="citation" data-cites="li2024wmdp">[33]</span>. As another example, we can use
representation control to control whether an AI lies or is honest <span
class="citation" data-cites="zou2023repe">[21]</span>. Though this research area is relatively new, its techniques show early promise.

<p><strong>Machine unlearning is a promising way to reduce hazards posed by AI systems.</strong> Machine unlearning refers to a set of techniques that aim to remove certain types of knowledge
from an AI system. This was originally discussed in the context of privacy concerns and removing personal information that may have been included in training datasets. However, other types
of unlearning techniques are highly relevant in the context of AI misuse or rogue AI systems. With effective unlearning techniques, we may be able to remove specific capabilities from AI 
systems so that they are not able to support these actors with planning terrorist acts or other kinds of severe misuse. For example, by removing certain types of virology knowledge from 
AI systems, we could make these systems less useful to anyone interested in creating bioweapons <span
class="citation" data-cites="li2024wmdp">[33]</span>. As previously discussed in 1.2, one of ways that AI
systems could lead to societal-scale catastrophes is by enabling a much wider range of people to carry out catastrophic acts of terrorism such as unleashing a bio-engineered pandemic. 
More speculatively, we might be able to reduce the likelihood and potential danger posed by deceptive or power-seeking AI systems by removing certain types of knowledge about their 
environment or mode of operation.

<p>Unlearning is only one of a variety of tools available to AI developers looking to restrict misuse of their systems. However, it presents some advantages over other approaches: machine
unlearning does not depend on being able to control inputs or outputs of the model so as to filter these, or training the model to reject malicious requests in a way that is robust to 
later fine-tuning or other alterations. Unlearning can be complemented by other approaches such as filtering training data to remove data that contains hazardous knowledge.
However, the research field of unlearning is nascent and there are open questions to be answered. If hazardous knowledge is easy for models to re-learn based on limited fine-tuning data, this
would reduce the value of unlearning. There is also a need to identify empirically which types of hazardous knowledge can be removed without significantly degrading the model's general 
usefulness for many innocuous tasks.

<p><strong>Conclusion.</strong> Uncontrollable AI systems could pose severe risks, particularly if they exhibit deceptive or power-seeking tendencies. In order to pre-empt these risks, 
we need to develop better tools that enable us to identify evidence of dangerous propensities in AI systems and to remove or control these. Representation control and machine unlearning
 are emerging areas of research that show promise for tackling these challenges. There are many open questions to be explored in seeing how far these and other techniques can be applied
 in order to ensure that AI systems can be controlled.</p>

<br>
<br>
<h3>References</h3>
<div id="refs" class="references csl-bib-body" data-entry-spacing="0"
role="list">
<div id="ref-park2023aia" class="csl-entry" role="listitem">
<div class="csl-left-margin">[1] P.
S. Park, S. Goldstein, A. O’Gara, M. Chen, and D. Hendrycks,
<span>“<span>AI Deception</span>: <span>A Survey</span> of
<span>Examples</span>, <span>Risks</span>, and <span>Potential
Solutions</span>.”</span> <span>arXiv</span>, Aug. 2023. doi: <a
href="https://doi.org/10.48550/arXiv.2308.14752">10.48550/arXiv.2308.14752</a>.</div>
</div>
<div id="ref-Hubinger2019RisksFL" class="csl-entry" role="listitem">
<div class="csl-left-margin">[2] E.
Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, and S. Garrabrant,
<span>“Risks from learned optimization in advanced machine learning
systems,”</span> <em>ArXiv</em>, 2019.</div>
</div>
<div id="ref-perolat2022mastering" class="csl-entry" role="listitem">
<div class="csl-left-margin">[3] J.
Perolat <em>et al.</em>, <span>“Mastering the <span>Game</span> of
<span>Stratego</span> with <span>Model-Free Multiagent Reinforcement
Learning</span>,”</span> <em>Science</em>, vol. 378, no. 6623, pp.
990–996, Dec. 2022, doi: <a
href="https://doi.org/10.1126/science.add4679">10.1126/science.add4679</a>.</div>
</div>
<div id="ref-lin2022truthfulqa" class="csl-entry" role="listitem">
<div class="csl-left-margin">[4] S.
Lin, J. Hilton, and O. Evans, <span>“<span>TruthfulQA</span>:
<span>Measuring How Models Mimic Human Falsehoods</span>.”</span>
<span>arXiv</span>, May 2022. doi: <a
href="https://doi.org/10.48550/arXiv.2109.07958">10.48550/arXiv.2109.07958</a>.</div>
</div>
<div id="ref-bakhtin2022humanlevel" class="csl-entry" role="listitem">
<div class="csl-left-margin">[5] A.
Bakhtin <em>et al.</em>, <span>“Human-level play in the game of
<span>Diplomacy</span> by combining language models with strategic
reasoning,”</span> <em>Science</em>, vol. 378, no. 6624, pp. 1067–1074,
Dec. 2022, doi: <a
href="https://doi.org/10.1126/science.ade9097">10.1126/science.ade9097</a>.</div>
</div>
<div id="ref-Ekman2004" class="csl-entry" role="listitem">
<div class="csl-left-margin">[6] P.
Ekman, <span>“Emotions revealed,”</span> <em>BMJ</em>, vol. 328, no.
Suppl S5, 2004, doi: <a
href="https://doi.org/10.1136/sbmj.0405184">10.1136/sbmj.0405184</a>.</div>
</div>
<div id="ref-Powell2017" class="csl-entry" role="listitem">
<div class="csl-left-margin">[7] P.
Powell and J. Roberts, <span>“Situational determinants of cognitive,
affective, and compassionate empathy in naturalistic digital
interactions,”</span> <em>Computers in Human Behaviour</em>, vol. 68,
pp. 137–148, Nov. 2016, doi: <a
href="https://doi.org/10.1016/j.chb.2016.11.024">10.1016/j.chb.2016.11.024</a>.</div>
</div>
<div id="ref-Wai2012" class="csl-entry" role="listitem">
<div class="csl-left-margin">[8] M.
Wai and N. Tiliopoulos, <span>“The affective and cognitive empathic
nature of the dark triad of personality,”</span> <em>Personality and
Individual Differences</em>, vol. 52, pp. 794–799, May 2012, doi: <a
href="https://doi.org/10.1016/j.paid.2012.01.008">10.1016/j.paid.2012.01.008</a>.</div>
</div>
<div id="ref-Cohen2011" class="csl-entry" role="listitem">
<div class="csl-left-margin">[9] S.
Baron-Cohen, <em>Zero degrees of empathy: A new theory of human
cruelty</em>. Penguin UK, 2011.</div>
</div>
<div id="ref-Singer2014Empathy" class="csl-entry" role="listitem">
<div class="csl-left-margin">[10] T.
Singer and O. M. Klimecki, <span>“Empathy and compassion,”</span>
<em>Current Biology</em>, vol. 24, no. 18, pp. R875–R878, 2014, doi: <a
href="https://doi.org/10.1016/j.cub.2014.06.054">https://doi.org/10.1016/j.cub.2014.06.054</a>.</div>
</div>
<div id="ref-christiano2023deep" class="csl-entry" role="listitem">
<div class="csl-left-margin">[11] P.
Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei,
<span>“Deep reinforcement learning from human preferences.”</span>
<span>arXiv</span>, Feb. 2023. doi: <a
href="https://doi.org/10.48550/arXiv.1706.03741">10.48550/arXiv.1706.03741</a>.</div>
</div>
<div id="ref-hotten2015volkswagen" class="csl-entry" role="listitem">
<div class="csl-left-margin">[12] R.
Hotten, <span>“Volkswagen: <span>The</span> scandal explained,”</span>
<em>BBC News</em>, Sep. 2015.</div>
</div>
<div id="ref-bostrom2014superintelligence" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[13] N.
Bostrom, <em>Superintelligence: Paths, dangers, strategies</em>, First
edition. <span>Oxford</span>: <span>Oxford University Press</span>,
2014.</div>
</div>
<div id="ref-casper2023red" class="csl-entry" role="listitem">
<div class="csl-left-margin">[14] S.
Casper <em>et al.</em>, <span>“Red <span>Teaming Deep Neural
Networks</span> with <span>Feature Synthesis Tools</span>.”</span>
<span>arXiv</span>, Jul. 2023.</div>
<div id="ref-French1959TheBO" class="csl-entry" role="listitem">
<div class="csl-left-margin">[15] J.
R. French and B. H. Raven, <span>“The bases of social power.”</span>
1959.</div>
</div>
<div id="ref-raven1964power" class="csl-entry" role="listitem">
<div class="csl-left-margin">[16] B.
H. Raven, <span>“Social influence and power,”</span> 1964.</div>
</div>
<div id="ref-hobbes1651hobbes" class="csl-entry" role="listitem">
<div class="csl-left-margin">[17] T.
Hobbes, <em>Hobbes’s <span>Leviathan</span></em>. 1651.</div>
</div>
<div id="ref-muehlhauser2012intelligence" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[18] L.
Muehlhauser and A. Salamon, <span>“Intelligence explosion: Evidence and
import,”</span> in <em>Singularity hypotheses: A scientific and
philosophical assessment</em>, A. H. Eden and J. H. Moor, Eds.,
Springer, 2012, pp. 15–40.</div>
</div>
<div id="ref-pan2023rewards" class="csl-entry" role="listitem">
<div class="csl-left-margin">[19] A.
Pan <em>et al.</em>, <span>“Do the rewards justify the means? Measuring
trade-offs between rewards and ethical behavior in the MACHIAVELLI
benchmark.”</span> 2023. Available: <a
href="https://arxiv.org/abs/2304.03279">https://arxiv.org/abs/2304.03279</a></div>
</div>
<div id="ref-carlsmith2022powerseeking" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[20] J.
Carlsmith, <span>“Is power-seeking AI an existential risk?”</span> 2022.
Available: <a
href="https://arxiv.org/abs/2206.13353">https://arxiv.org/abs/2206.13353</a></div>
</div>
<div id="ref-omohundro2008artificial" class="csl-entry" role="listitem">
<div class="csl-left-margin">[21] S.
Omohundro, <span>“Artificial <span>General Intelligence</span> 2008:
<span>Proceedings</span> of the <span>First AGI
Conference</span>,”</span> in <em>Artificial <span>General
Intelligence</span> 2008</em>, 2008.</div>
</div>
<div id="ref-russell2021human" class="csl-entry" role="listitem">
<div class="csl-left-margin">[22] S.
Russell, <span>“Human-compatible artificial intelligence,”</span>
<em>Human-like machine intelligence</em>, pp. 3–23, 2021.</div>
</div>
<div id="ref-baker2019emergent" class="csl-entry" role="listitem">
<div class="csl-left-margin">[23] B.
Baker <em>et al.</em>, <span>“Emergent <span>Tool Use From Multi-Agent
Autocurricula</span>,”</span> <em>arXiv.org</em>. Sep. 2019.</div>
</div>
<div id="ref-silver2014exploration" class="csl-entry" role="listitem">
<div class="csl-left-margin">[24] D.
Silver, <span>“Exploration and exploitation.”</span> Computer Science
Department, University of London, 2014.</div>
</div>
<div id="ref-kegley2020politics" class="csl-entry" role="listitem">
<div class="csl-left-margin">[25] C.
W. Kegley and S. L. Blanton, <em>Trend and transformation, 2014 -
2015</em>. 2020, p. 259.</div>
</div>
<div id="ref-mearsheimer2007structural" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[26] J.
J. Mearsheimer, <em>Structural realism</em>. 2007, pp. 77–94.</div>
</div>
<div id="ref-waltz2010theory" class="csl-entry" role="listitem">
<div class="csl-left-margin">[27] K.
Waltz, <em>Theory of international politics</em>. Waveland Press, 2010,
p. 93.</div>
</div>
<div id="ref-sep-realism-intl-relations" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[28] W.
J. Korab-Karpowicz, <span>“<span class="nocase">Political Realism in
International Relations</span>,”</span> in <em>The <span>Stanford</span>
encyclopedia of philosophy</em>, <span>S</span>ummer 2018., E. N. Zalta,
Ed., <a
href="https://plato.stanford.edu/archives/sum2018/entries/realism-intl-relations/"
class="uri">https://plato.stanford.edu/archives/sum2018/entries/realism-intl-relations/</a>;
Metaphysics Research Lab, Stanford University, 2018.</div>
</div>
<div id="ref-montgomery2006breaking" class="csl-entry" role="listitem">
<div class="csl-left-margin">[29] E.
Montgomery, <span>“Breaking out of the security dilemma: Realism,
reassurance, and the problem of uncertainty,”</span> <em>International
Security - INT SECURITY</em>, vol. 31, pp. 151–185, Oct. 2006, doi: <a
href="https://doi.org/10.1162/isec.2006.31.2.151">10.1162/isec.2006.31.2.151</a>.</div>
</div>
<div id="ref-Carlsmith2022IsPA" class="csl-entry" role="listitem">
<div class="csl-left-margin">[30] J.
Carlsmith, <span>“Is power-seeking AI an existential risk?”</span>
<em>ArXiv preprint arXiv: 2206.13353</em>, 2022.</div>
</div>
<div id="ref-ouyang2022training" class="csl-entry" role="listitem">
<div class="csl-left-margin">[31] Long Ouyang et al., <span>“Training language models to follow instructions with human feedback”</span>
<em>ArXiv preprint arXiv: 2203.02155</em>, 2022.</div>
</div>
<div id="ref-rafailov2023direct" class="csl-entry" role="listitem">
<div class="csl-left-margin">[32] Long Ouyang et al., <span>“Training language models to follow instructions with human feedback”</span>
<em>ArXiv preprint arXiv: 2203.02155</em>, 2022.</div>
</div>
<div id="ref-li2024wmdp" class="csl-entry" role="listitem">
<div class="csl-left-margin">[33] Rafael Rafailov et al., <span>“Direct Preference Optimization: Your Language Model is Secretly a Reward Model”</span>
<em>ArXiv preprint arXiv: 2305.18290</em>, 2023.</div>
</div>
</div>