-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathaises_3_6
124 lines (123 loc) · 7.83 KB
/
aises_3_6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
<h1 id="safety-and-general-capabilities">3.6 Safety and General
Capabilities</h1>
<p>While more capable AI systems can be more reliable, they can also be
more dangerous. Often, though not always, safety and general
capabilities are hard to disentangle. Because of this interdependence,
it is important for researchers aiming to make AI systems safer to
carefully avoid increasing risks from more powerful AI systems.</p>
<p><strong>General capabilities.</strong> Research developments are
interconnected. Though researchers may work on specialized, delimited
problems, their discoveries may have much wider ranging effects. For
example, work that improves accuracy of image classifiers on the
ImageNet benchmark often has downstream effects on other image tasks,
including image segmentation and video understanding. Scaling language
models so that they become better at next token prediction on a
pre-training dataset also improves performance on tasks like question
answering and code completion. We refer to these kinds of capabilities
as general capabilities, because they are good indicators for how well a
model is able to perform at a wide range of tasks. Examples of general
capabilities include data compression, executing instructions,
reasoning, planning, researching, optimization, sequential decision
making, recursive self-improvement, models using the internet, and so
on.</p>
<p><strong>General capabilities have a mixed effect on safety.</strong>
Systems that are more generally capable tend to make fewer mistakes. If
the consequences of failure are dire, then advancing capabilities can
reduce risk factors. However, as we discuss in the Safety Engineering chapter, safety is an
emergent property and cannot be reduced to a collection of metrics.
Improving general capabilities may remove some hazards, but it does not
necessarily make a system safer. For example, more accurate image
classifiers make fewer errors, and systems that are better at planning
are less likely to generate plans that fail or that are infeasible. More
capable language models are better able to avoid giving harmful or
unhelpful answers. When mistakes are harmful, more generally capable
models may be safer. On the other hand, systems that are more generally
capable can be more dangerous and exacerbate control problems. For
example, AI systems with better reasoning capacity, could be better able
to deceive humans, and AI systems that are better at optimizing proxies
may be better at gaming those metrics. As a result, improvements in
general capabilities may be overall detrimental to safety and hasten the
onset of catastrophic risks.</p>
<p><strong>Research on general capabilities is not the best way to
improve safety.</strong> The fact that there can be correlations between
safety and general capabilities does not mean that the best way to
improve safety overall is to improve general capabilities. If
improvements in general capabilities were the only thing necessary for
adequate safety, then there would be no need for safety-specific
research. Unfortunately, there are many risks, such as deceptive
alignment, adversarial attacks, and Trojan attacks, which do not
decrease or vanish with additional scaling. Consequently, targeted
safety research is necessary.</p>
<p><strong>Safety research can produce general capabilities
externalities.</strong> In some cases, research aimed at improving the
safety of models can increase general capabilities, which potentially
hastens the onset of new hazards. For example, reinforcement learning
from human feedback was originally developed to improve the safety of AI
systems, but it also had the effect of making large language models more
capable at completing various tasks specified by a user, indicating an
improvement in general capabilities. Models trained to access external
databases to reduce the risk that they output incorrect information may
gain more knowledge or be able to reason over longer time windows than
before, which results in an improvement in general capabilities.
Research that greatly increases general capabilities is said to have
high general <em>capabilities externalities</em>.</p>
<p><strong>Capabilities are highly correlated.</strong> Large language
model accuracies in various diverse topics, such as philosophy,
mathematics, and biology have extremely strong correlations (<span
class="math inline"><em>r</em> > 0.8)</span> <span class="citation"
data-cites="Ilic2023UnveilingTG">[1]</span>. The strong correlation
between different capabilities is a reason that separating safety
metrics from capabilities is challenging—most capabilities are extremely
correlated with general capabilities. It is safe to assume that, in most
cases, a robust improvement in an important capability is correlated
with improvements in general capabilities.</p>
<p><strong>It is possible to disentangle safety from general
capabilities.</strong> There are examples of safety being disentangled
from general capabilities, so they are not inextricably bound. Many
years of research on adversarial robustness of image classifiers has
improved many different kinds of adversarial robustness without any
corresponding improvements in overall accuracy. An intuition for this is
that being robust to rare poisoned examples does not robustly correspond
to an increase in general intelligence, much like how being more
resilient to toxins does not make a person get better grades or execute
typical tasks better. Likewise, improvements in transparency, anomaly
detection, and Trojan detection have a track record of not improving
general capabilities. To determine how a research goal affects general
capabilities, it is important to empirically measure how a method
affects safety metrics and capabilities metrics, as its impact is often
not obvious beforehand.</p>
<p><strong>Researchers should avoid general capabilities
externalities.</strong> Because safety and general capabilities are
interconnected, it is wholly insufficient to argue that one’s research
reduces or eliminates a particular hazard. Many general capabilities
reduce particular hazards. Rather, a holistic risk assessment is
necessary, which requires incorporating empirical estimates for how the
line of research increases general capabilities externalities. Research
should aim to differentially improve safety; that is, reduce the overall
level of risk compared to the most likely alternatives. This is called
<em>differential technological development</em>, where we try to speed
up the development of safety features and slow down the development of
more dangerous features.<p>
Overall, AI research can and should be directed towards goals that enhance the
safety of AI systems and increase societal resilience to risks from AI. Naive
AI safety research may inadvertently increase some risks even while reducing others.
While this section covered the risk of accelerating general capabilities, we can also
take a more general lesson that research on safety may have unintended consequences
and could even paradoxically reduce safety. Researchers should guard against this by
empirically assessing the impact of their works on multiple risk sources, not just the
one they aim to target. As discussed in the previous sections, there are many research
areas that can improve the safety of AI systems or provide society with defenses against
some AI risks, while avoiding dangerous acceleration of AI's general capabilities.
Boosting research in these areas could provide a robust way to reduce risks from AI while avoiding some of the pitfalls described here. </p>
<br>
<br>
<h3>References</h3>
<div id="refs" class="references csl-bib-body" data-entry-spacing="0"
role="list">
<div id="ref-Ilic2023UnveilingTG" class="csl-entry" role="listitem">
<div class="csl-left-margin">[1] D.
Ili’c, <span>“Unveiling the general intelligence factor in language
models: A psychometric approach,”</span> <em>ArXiv</em>, vol.
abs/2310.11616, 2023.</div>
</div>
</div>