-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathaises_3_7
204 lines (190 loc) · 10.2 KB
/
aises_3_7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
<style type="text/css">
table.tableLayout{
margin: auto;
border: 1px solid;
border-collapse: collapse;
border-spacing: 1px;
}
table.tableLayout tr{
border: 1px solid;
border-collapse: collapse;
padding: 5px;
}
table.tableLayout th{
border: 1px solid;
border-collapse: collapse;
padding: 3px;
}
table.tableLayout td{
border: 1px solid;
padding: 5px;
}
</style>
<h1 id="conclusion">3.7 Conclusion</h1>
<p>In this chapter, we discussed several key themes: we do not know how
to instill our values robustly in individual AI systems, we are unable
to predict future AI systems, and we cannot reliably evaluate AI
systems. We now discuss each in turn.</p>
<p><strong>We do not know how to instill our values robustly in
individual AI systems.</strong> It is difficult to perfectly capture our
idealized goals in the proxies we use to train AIs. The problem begins
with the learning setup. In gathering data to form a training set and in
choosing quantitative proxies to optimize, we typically have to make
compromises that can introduce bias and perpetuate harms or leave room
for proxy gaming and adversarial exploits.<p>
In particular, our proxies may be too simple, or the systems we use to
supervise AIs may run into practical and physical limitations. As a
result, there is a gap between proxies and idealized goals that
optimizers can exploit or adversarially optimize against. If proxies end
up diverging considerably from our idealized goals, we may end up with
capable AI systems optimizing for goals contrary to human values.</p>
<p><strong>We are unable to predict future AI systems.</strong> Emergent
capabilities and behaviors mean that we cannot reliably predict the
properties of future AI systems or even current AI systems. AIs can
suddenly and dramatically improve on specific tasks without much
warning, which leaves us little time to react if those changes are
harmful.<p>
Even when we are able to robustly specify our idealized goals, processes
including mesa-optimization and intrinsification mean that trained AI
systems can end up with emergent goals that conflict with these
specifications. AI systems may end up operationally pursuing goals
different to the ones we gave them and thus change their behaviors
systematically over long time periods. This is further complicated by
emergent social dynamics that arise from interacting AI systems
(explored further in the chapter on Collective Action Problems).</p>
<p><strong>We cannot reliably evaluate AI systems.</strong> AI systems
may learn and be incentivized to deceive humans and other AIs, in which
case their behavior stops being a reliable signal for their future
behavior. In the limiting case, AI systems may cooperate until exceeding
some threshold of power or capabilities, after which they defect and
execute a treacherous turn. This might be less of a problem if we
understood how AI systems’ internals work, but we currently lack the
thorough knowledge we would need to break open these “black boxes” and
look inside.<p>
This chapter has touched on a number of research areas in machine
learning and other subjects which may be useful to address risks from
AI. These are summarised in the table below, which is not intended to be
an exhaustive list.<p>
<div>
<table class="tableLayout">
<thead>
<tr>
<th style="text-align: center;"><b>Area</b></th>
<th style="text-align: center;"><b>Description</b></th>
</tr>
</thead>
<tr>
<td style="text-align: center;">Transparency</td>
<td style="text-align: left;">
<ul>
<li>This involves better monitoring and controlling the inner workings of systems based on deep learning</li>
<li>Both bottom-up approaches (i.e., mechanistic interpretability) and top-down approaches (i.e., representation engineering) are valuable to explore</li>
</ul>
</td>
</tr>
<tr>
<td style="text-align: center;">Trojan <br> Detection </td>
<td style="text-align: left;">
<ul>
<li>Machine learning systems may have hidden “backdoor” or “Trojan” controllable
vulnerabilities. Backdoored models behave correctly and benignly in almost all scenarios,
but in particular circumstances chosen by an adversary, they have been taught to
behave incorrectly</li>
<li>Improving our ability to detect these backdoors is both directly useful to prevent misuse,
and can also serve as a useful testbed for identifying risks of loss of control over AI
systems in specific circumstances, such as a “treacherous turn”</li>
</ul>
</td>
</tr>
<tr>
<td style="text-align: center;">Evaluation <br> of <br> Hazardous <br> Capabilities </td>
<td style="text-align: left;">
<ul>
<li>AI systems may exhibit hazardous capabilities in domains such as cyber-attacks or bio-engineering. Capabilities may emerge is a way that we cannot predict as
AI systems are scaled up or otherwise improved. There are a variety of challenges to effective evaluation of these capabilities including the generality and large "surface area" of many pre-trained models,
a high degree of sensitivity to the details of how prompts and evaluations are constructed, and limited available benchmarks and techniques for assessing certain capabilities</li>
<li>Evaluations can support pre-deployment assessment and ongoing monitoring of risks from AI systems, including risks from hazardous capabilities. Ideally, evaluations
would not only detect existing hazardous capabilities, but also make it easier to track and predict the progress of models' capabilities in relevant domains and skills, and be suitable
not just for one-off tests but for ongoing monitoring of systems during both training and deployment. It will also be important to develop ways to test the validity of evaluations to
avoid false negatives</li>
</ul>
</td>
</tr>
<tr>
<td style="text-align: center;">Anomaly <br> Detection </td>
<td style="text-align: left;">
<ul>
<li>This area is about detecting potential novel hazards such as unknown unknowns,
unexpected rare events, or emergent phenomena. There are numerous relevant hazards
from AI and other sources that anomaly detection could possibly identify more reliably
or identify earlier, including proxy gaming, rogue or unethical AI systems, deception
from AI systems, Trojan models, and malicious use</li>
<li>Anomaly detection can allow models to flag salient anomalies for human review or
execute a conservative fallback policy. A successful anomaly detector could serve as an
AI watchdog that could reliably detect and triage rogue AI threats. Anomaly detectors
could also be helpful for detecting other threats such as novel biological hazards
</li>
</ul>
</td>
</tr>
<tr>
<td style="text-align: center;">Adversarial <br> Robustness </td>
<td style="text-align: left;">
<ul>
<li>Increasing the robustness of AI systems can help to prevent misuse. It can also reduce
risks of proxy gaming by making models more robust against optimizers</li>
<li>While robustness is a well-developed field in some parts of machine learning such as
computer vision, new areas to explore include adversarial attacks against LLMs and multimodal agents</li>
</ul>
</td>
</tr>
<tr>
<td style="text-align: center;">Machine <br> Unlearning </td>
<td style="text-align: left;">
<ul>
<li>Deliberately removing specific capabilities from an AI system could be useful to ensure that AI systems that are broadly accessible do not have the
know-how to support serious misuse such as helping with creating bioweapons or conducting cyber-attacks</li>
<li>Unlearning could be particularly helpful for models fine-tuned through APIs;
before users can access their fine-tuned models, hazardous capabilities could be scrubbed before the models can be queried</li>
</ul>
</td>
</tr>
<tr>
<td style="text-align: center;">Machine <br> Ethics </td>
<td style="text-align: left;">
<ul>
<li>Machine ethics aims at creating actionable ethical objectives for systems to pursue, and
improving the tradeoff between ethical behavior and practical performance. If advanced
AIs are given objectives that are poorly specified, they could pursue undesirable
actions and behave unethically. If these advanced AIs are sufficiently powerful, these
misspecifications could lead the AIs to create a future that we would strongly dislike</li>
<li>Ambitious research goals in this area include building models with relevant abilities such
as detecting situations where moral principles apply, assessing how to apply those moral
principles, evaluating the moral worth of candidate actions, selecting and carrying out
actions appropriate for the context, monitoring the success or failure of those actions,
and adjusting responses accordingly</li>
</ul>
</td>
</tr>
<tr>
<td style="text-align: center;">Systemic <br> Safety </td>
<td style="text-align: left;">
<ul>
<li>This theme covers a variety of topics that do not improve AI models themselves but
help to make societies more robust against harms that AI systems might cause</li>
<li>Relevant research topics include deep learning for cyberdefense (e.g., deep
learning for intrusion detection), detecting anomalous biological phenomena (e.g.,
wastewater detection of pandemic pathogens), and proof-of-humanity mechanisms (e.g.,
watermarks)</li>
</ul>
</td>
</tr>
</table>
</div>
<p>
</p>
In conclusion, single-agent systems present significant and
hard-to-mitigate threats that could result in catastrophic risks—even
before the considerations of misuse, multi-agent systems, and arms race
dynamics that we discuss in subsequent chapters.<p>
</p>