-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathOLD_aises_1_4
684 lines (679 loc) · 39.2 KB
/
OLD_aises_1_4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
<style>
.storybox{
border-radius: 15px;
border: 2px solid gray;
background-color: lightgray;
text-align: left;
padding: 10px;
}
</style>
<style>
.storyboxlegend{
border-bottom-style: solid;
border-bottom-color: gray;
border-bottom-width: 3px;
margin-left: -12px;
margin-right: -12px; margin-top: -13px;
padding: 0.2em 1em; color: #ffffff;
background-color: gray;
border-radius: 15px 15px 0px 0px}
</style>
</head>
<body>
<h1 id="sec:organizational">1.4 Organizational Risks</h1>
<p>In January 1986, tens of millions of people tuned in to watch the
launch of the Challenger Space Shuttle. Approximately 73 seconds after
liftoff, the shuttle exploded, resulting in the deaths of everyone on
board. Though tragic enough on its own, one of its crew members was a
school teacher named Sharon Christa McAuliffe. McAuliffe was selected
from over 10,000 applicants for the NASA Teacher in Space Project and
was scheduled to become the first teacher to fly in space. As a result,
millions of those watching were schoolchildren. NASA had the best
scientists and engineers in the world, and if there was ever a mission
NASA didn’t want to go wrong, it was this one <span class="citation"
data-cites="uri_35_2021">[1]</span>.<p/>
The Challenger disaster, alongside other catastrophes, serves as a
chilling reminder that even with the best expertise and intentions,
accidents can still occur. As we progress in developing advanced AI
systems, it is crucial to remember that these systems are not immune to
catastrophic accidents. An essential factor in preventing accidents and
maintaining low levels of risk lies in the organizations responsible for
these technologies. In this section, we discuss how organizational
safety plays a critical role in the safety of AI systems. First, we
discuss how even without competitive pressures or malicious actors,
accidents can happen—in fact, they are inevitable. We then discuss how
improving organizational factors can reduce the likelihood of AI
catastrophes.</p>
<p><strong>Catastrophes occur even when competitive pressures are
low.</strong> Even in the absence of competitive pressures or malicious
actors, factors like human error or unforeseen circumstances can still
bring about catastrophe. The Challenger disaster illustrates that
organizational negligence can lead to loss of life, even when there is
no urgent need to compete or outperform rivals. By January 1986, the
space race between the US and USSR had largely diminished, yet the
tragic event still happened due to errors in judgment and insufficient
safety precautions.<p/>
Similarly, the Chernobyl nuclear disaster in April 1986 highlights how
catastrophic accidents can occur in the absence of external pressures.
As a state-run project without the pressures of international
competition, the disaster happened when a safety test involving the
reactor’s cooling system was mishandled by an inadequately prepared
night shift crew. This led to an unstable reactor core, causing
explosions and the release of radioactive particles that contaminated
large swathes of Europe <span class="citation"
data-cites="iaea1992chernobyl">[2]</span>. Seven years earlier, America
came close to experiencing its own Chernobyl when, in March 1979, a
partial meltdown occurred at the Three Mile Island nuclear power plant.
Though less catastrophic than Chernobyl, both events highlight how even
with extensive safety measures in place and few outside influences,
catastrophic accidents can still occur.<p/>
Another example of a costly lesson on organizational safety came just
one month after the accident at Three Mile Island. In April 1979, spores
of <em>Bacillus anthracis</em>—or simply “anthrax,” as it is commonly
known—were accidentally released from a Soviet military research
facility in the city of Sverdlovsk. This led to an outbreak of anthrax
that resulted in at least 66 confirmed deaths <span class="citation"
data-cites="Meselson1994TheSA">[3]</span>. Investigations into the
incident revealed that the cause of the release was a procedural failure
and poor maintenance of the facility’s biosecurity systems, despite
being operated by the state and not subjected to significant competitive
pressures.<p/>
The unsettling reality is that AI is far less understood and AI industry
standards are far less stringent than nuclear technology and rocketry.
Nuclear reactors are based on solid, well-established and
well-understood theoretical principles. The engineering behind them is
informed by that theory, and components are stress-tested to the
extreme. Nonetheless, nuclear accidents still happen. In contrast, AI
lacks a comprehensive theoretical understanding, and its inner workings
remain a mystery even to those who create it. This presents an added
challenge of controlling and ensuring the safety of a technology that we
do not yet fully comprehend.</p>
<p><strong>AI accidents could be catastrophic.</strong> Accidents in AI
development could have devastating consequences. For example, imagine an
organization unintentionally introduces a critical bug in an AI system
designed to accomplish a specific task, such as helping a company
improve its services. This bug could drastically alter the AI’s
behavior, leading to unintended and harmful outcomes. One historical
example of such a case occurred when researchers at OpenAI were
attempting to train an AI system to generate helpful, uplifting
responses. During a code cleanup, the researchers mistakenly flipped the
sign of the reward used to train the AI <span class="citation"
data-cites="ziegler2019fine">[4]</span>. As a result, instead of
generating helpful content, the AI began producing hate-filled and
sexually explicit text overnight without being halted. Accidents could
also involve the unintentional release of a dangerous, weaponized, or
lethal AI system. Since AIs can be easily duplicated with a simple
copy-paste, a leak or hack could quickly spread the AI system beyond the
original developers’ control. Once the AI system becomes publicly
available, it would be nearly impossible to put the genie back in the
bottle.<p/>
Gain-of-function research could potentially lead to accidents by pushing
the boundaries of an AI system’s destructive capabilities. In these
situations, researchers might intentionally train an AI system to be
harmful or dangerous in order to understand its limitations and assess
possible risks. While this can lead to useful insights into the risks
posed by a given AI system, future gain-of-function research on advanced
AIs might uncover capabilities significantly worse than anticipated,
creating a serious threat that is challenging to mitigate or control. As
with viral gain-of-function research, pursuing AI gain-of-function
research may only be prudent when conducted with strict safety
procedures, oversight, and a commitment to responsible information
sharing. These examples illustrate how AI accidents could be
catastrophic and emphasize the crucial role that organizations
developing these systems play in preventing such accidents.</p>
<h2 id="accidents-are-hard-to-avoid">1.4.1 Accidents Are Hard to Avoid</h2>
<p><strong>When dealing with complex systems, the focus needs to be
placed on ensuring accidents don’t cascade into catastrophes.</strong>
In his book “<em>Normal Accidents: Living with High-Risk
Technologies</em>,” sociologist Charles Perrow argues that accidents are
inevitable and even “normal” in complex systems, as they are not merely
caused by human errors but also by the complexity of the systems
themselves <span class="citation"
data-cites="perrow1984normal">[5]</span>. In particular, such accidents
are likely to occur when the intricate interactions between components
cannot be completely planned or foreseen. For example, in the Three Mile
Island accident, a contributing factor to the lack of situational
awareness by the reactor’s operators was the presence of a yellow
maintenance tag, which covered valve position lights in the emergency
feedwater lines <span class="citation"
data-cites="Rogovin1980ThreeMI">[6]</span>. This prevented operators
from noticing that a critical valve was closed, demonstrating the
unintended consequences that can arise from seemingly minor interactions
within complex systems.<p/>
Unlike nuclear reactors, which are relatively well-understood despite
their complexity, complete technical knowledge of most complex systems
is often nonexistent. This is especially true of deep learning systems,
for which the inner workings are exceedingly difficult to understand,
and where the reason why certain design choices work can be hard to
understand even in hindsight. Furthermore, unlike components in other
industries, such as gas tanks, which are highly reliable, deep learning
systems are neither perfectly accurate nor highly reliable. Thus, the
focus for organizations dealing with complex systems, especially deep
learning systems, should not be solely on eliminating accidents, but
rather on ensuring that accidents do not cascade into catastrophes.</p>
<p><strong>Accidents are hard to avoid because of sudden, unpredictable
developments.</strong> Scientists, inventors, and experts often
significantly underestimate the time it takes for a groundbreaking
technological advancement to become a reality. The Wright brothers
famously claimed that powered flight was fifty years away, just two
years before they achieved it. Lord Rutherford, a prominent physicist
and the father of nuclear physics, dismissed the idea of extracting
energy from nuclear fission as “moonshine,” only for Leo Szilard to
invent the nuclear chain reaction less than 24 hours later. Similarly,
Enrico Fermi expressed 90 percent confidence in 1939 that it was
impossible to use uranium to sustain a fission chain reaction—yet, just
four years later he was personally overseeing the first reactor <span
class="citation" data-cites="rhodes1986making">[7]</span>.<p/>
AI development could catch us off guard too. In fact, it often does. The
defeat of Lee Sedol by AlphaGo in 2016 came as a surprise to many
experts, as it was widely believed that achieving such a feat would
still require many more years of development. More recently, large
language models such as GPT-4 have demonstrated spontaneously emergent
capabilities <span class="citation"
data-cites="Bubeck2023SparksOA">[8]</span>. On existing tasks, their
performance is hard to predict in advance, often jumping up without
warning as more resources are dedicated to training them. Furthermore,
they often exhibit astonishing new abilities that no one had previously
anticipated, such as the capacity for multi-step reasoning and learning
on-the-fly, even though they were not deliberately taught these skills.
This rapid and unpredictable evolution of AI capabilities presents a
significant challenge for preventing accidents. After all, it is
difficult to control something if we don’t even know what it can do or
how far it may exceed our expectations.</p>
<p><strong>It often takes years to discover severe flaws or
risks.</strong> History is replete with examples of substances or
technologies initially thought safe, only for their unintended flaws or
risks to be discovered years, if not decades, later. For example, lead
was widely used in products like paint and gasoline until its neurotoxic
effects came to light <span class="citation"
data-cites="Lidsky2003LeadNI">[9]</span>. Asbestos, once hailed for its
heat resistance and strength, was later linked to serious health issues,
such as lung cancer and mesothelioma <span class="citation"
data-cites="Mossman1990AsbestosSD">[10]</span>. The “Radium Girls”
suffered grave health consequences from radium exposure, a material they
were told was safe to put in their mouths <span class="citation"
data-cites="moore2017radium">[11]</span>. Tobacco, initially marketed as
a harmless pastime, was found to be a primary cause of lung cancer and
other health problems <span class="citation"
data-cites="Hecht1999TobaccoSC">[12]</span>. CFCs, once considered
harmless and used to manufacture aerosol sprays and refrigerants, were
found to deplete the ozone layer <span class="citation"
data-cites="Molina1974StratosphericSF">[13]</span>. Thalidomide, a drug
intended to alleviate morning sickness in pregnant women, led to severe
birth defects <span class="citation"
data-cites="Kim2011ThalidomideTT">[14]</span>. And more recently, the
proliferation of social media has been linked to an increase in
depression and anxiety, especially among young people <span
class="citation" data-cites="Keles2019ASR">[15]</span>.<p/>
This emphasizes the importance of not only conducting expert testing but
also implementing slow rollouts of technologies, allowing the test of
time to reveal and address potential flaws before they impact a larger
population. Even in technologies adhering to rigorous safety and
security standards, undiscovered vulnerabilities may persist, as
demonstrated by the Heartbleed bug—a serious vulnerability in the
popular OpenSSL cryptographic software library that remained undetected
for years before its eventual discovery <span class="citation"
data-cites="Durumeric2014TheMO">[16]</span>.<p/>
Furthermore, even state-of-the-art AI systems, which appear to have
solved problems comprehensively, may harbor unexpected failure modes
that can take years to uncover. For instance, while AlphaGo’s
groundbreaking success led many to believe that AIs had conquered the
game of Go, a subsequent adversarial attack on another highly advanced
Go-playing AI, KataGo, exposed a previously unknown flaw <span
class="citation" data-cites="Wang2022AdversarialPB">[17]</span>. This
vulnerability enabled human amateur players to consistently defeat the
AI, despite its significant advantage over human competitors who are
unaware of the flaw. More broadly, this example highlights that we must
remain vigilant when dealing with AI systems, as seemingly airtight
solutions may still contain undiscovered issues. In conclusion,
accidents are unpredictable and hard to avoid, and understanding and
managing potential risks requires a combination of proactive measures,
slow technology rollouts, and the invaluable wisdom gained through
steady time-testing.</p>
<h2
id="organizational-factors-can-reduce-the-chances-of-catastrophe">1.4.2 Organizational
Factors can Reduce the Chances of Catastrophe</h2>
<p>Some organizations successfully avoid catastrophes while operating
complex and hazardous systems such as nuclear reactors, aircraft
carriers, and air traffic control systems <span class="citation"
data-cites="Laporte1991WorkingIP Dietterich2018RobustAI">[18],
[19]</span>. These organizations recognize that focusing solely on the
hazards of the technology involved is insufficient; consideration must
also be given to organizational factors that can contribute to
accidents, including human factors, organizational procedures, and
structure. These are especially important in the case of AI, where the
underlying technology is not highly reliable and remains poorly
understood.</p>
<p><strong>Human factors such as safety culture are critical for
avoiding AI catastrophes.</strong> One of the most important human
factors for preventing catastrophes is safety culture <span
class="citation" data-cites="leveson2016engineering manheim">[20],
[21]</span>. Developing a strong safety culture involves not only rules
and procedures, but also the internalization of these practices by all
members of an organization. A strong safety culture means that members
of an organization view safety as a key objective rather than a
constraint on their work. Organizations with strong safety cultures
often exhibit traits such as leadership commitment to safety, heightened
accountability where all individuals take personal responsibility for
safety, and a culture of open communication in which potential risks and
issues can be freely discussed without fear of retribution <span
class="citation" data-cites="national2014lessons">[22]</span>.
Organizations must also take measures to avoid alarm fatigue, whereby
individuals become desensitized to safety concerns because of the
frequency of potential failures. The Challenger Space Shuttle disaster
demonstrated the dire consequences of ignoring these factors when a
launch culture characterized by maintaining the pace of launches
overtook safety considerations. Despite the absence of competitive
pressure, the mission proceeded despite evidence of potentially fatal
flaws, ultimately leading to the tragic accident <span class="citation"
data-cites="vaughan1996challenger">[23]</span>.<p/>
Even in the most safety-critical contexts, in reality safety culture is
often not ideal. Take for example Bruce Blair, a former nuclear launch
officer and senior fellow at the Brookings Institution. He once
disclosed that before 1977, the US Air Force had astonishingly set the
codes used to unlock intercontinental ballistic missiles to “00000000”
<span class="citation" data-cites="lamothe_air_2014">[24]</span>. Here,
safety mechanisms such as locks can be rendered virtually useless by
human factors.<p/>
A more dramatic example illustrates how researchers sometimes accept a
non-negligible chance of causing extinction. Prior to the first nuclear
weapon test, an eminent Manhattan Project scientist calculated the bomb
could cause an existential catastrophe: the explosion might ignite the
atmosphere and cover the Earth in flames. Although Oppenheimer believed
the calculations were probably incorrect, he remained deeply concerned,
and the team continued to scrutinize and debate the calculations right
until the day of the detonation <span class="citation"
data-cites="ord2020precipice">[25]</span>. Such instances underscore the
need for a robust safety culture.</p>
<p><strong>A questioning attitude can help uncover potential
flaws.</strong> Unexpected system behavior can create opportunities for
accidents or exploitation. To counter this, organizations can foster a
questioning attitude, where individuals continuously challenge current
conditions and activities to identify discrepancies that might lead to
errors or inappropriate actions <span class="citation"
data-cites="NRC2011FR">[26]</span>. This approach helps to encourage
diversity of thought and intellectual curiosity, thus preventing
potential pitfalls that arise from uniformity of thought and
assumptions. The Chernobyl nuclear disaster illustrates the importance
of a questioning attitude, as the safety measures in place failed to
address the reactor design flaws and ill-prepared operating procedures.
A questioning attitude of the safety of the reactor during a test
operation might have prevented the explosion that resulted in deaths and
illnesses of countless people.</p>
<p><strong>A security mindset is crucial for avoiding worst-case
scenarios.</strong> A security mindset, widely valued among computer
security professionals, is also applicable to organizations developing
AIs. It goes beyond a questioning attitude by adopting the perspective
of an attacker and by considering worst-case, not just average-case,
scenarios. This mindset requires vigilance in identifying
vulnerabilities that may otherwise go unnoticed and involves considering
how systems might be deliberately made to fail, rather than only
focusing on making them work. It reminds us not to assume a system is
safe simply because no potential hazards come to mind after a brief
brainstorming session. Cultivating and applying a security mindset
demands time and serious effort, as failure modes can often be
surprising and unintuitive. Furthermore, the security mindset emphasizes
the importance of being attentive to seemingly benign issues or
“harmless errors,” which can lead to catastrophic outcomes either due to
clever adversaries or correlated failures <span class="citation"
data-cites="schneier2008security">[27]</span>. This awareness of
potential threats aligns with Murphy’s law—“Anything that can go wrong
will go wrong”—recognizing that this can be a reality due to adversaries
and unforeseen events.</p>
<p><strong>Organizations with a strong safety culture can successfully
avoid catastrophes.</strong> High Reliability Organizations (HROs) are
organizations that consistently maintain a heightened level of safety
and reliability in complex, high-risk environments <span
class="citation" data-cites="Laporte1991WorkingIP">[18]</span>. A key
characteristic of HROs is their preoccupation with failure, which
requires considering worst-case scenarios and potential risks, even if
they seem unlikely. These organizations are acutely aware that new,
previously unobserved failure modes may exist, and they diligently study
all known failures, anomalies, and near misses to learn from them. HROs
encourage reporting all mistakes and anomalies to maintain vigilance in
uncovering problems. They engage in regular horizon scanning to identify
potential risk scenarios and assess their likelihood before they occur.
By practicing surprise management, HROs develop the skills needed to
respond quickly and effectively when unexpected situations arise,
further enhancing an organization’s ability to prevent catastrophes.
This combination of critical thinking, preparedness planning, and
continuous learning could help organizations to be better equipped to
address potential AI catastrophes. However, the practices of HROs are
not a panacea. It is crucial for organizations to evolve their safety
practices to effectively address the novel risks posed by AI accidents
above and beyond HRO best practices.</p>
<p><strong>Most AI researchers do not understand how to reduce overall
risk from AIs.</strong> In most organizations building cutting-edge AI
systems, there is often a limited understanding of what constitutes
technical safety research. This is understandable because an AI’s safety
and intelligence are intertwined, and intelligence can help or harm
safety. More intelligent AI systems could be more reliable and avoid
failures, but they could also pose heightened risks of malicious use and
loss of control. General capabilities improvements can improve aspects
of safety, and it can hasten the onset of existential risks.
Intelligence is a double-edged sword <span class="citation"
data-cites="Hendrycks2022XRiskAF">[28]</span>.<p/>
Interventions specifically designed to improve safety may also
accidentally increase overall risks. For example, a common practice in
organizations building advanced AIs is to fine-tune them to satisfy user
preferences. This makes the AIs less prone to generating toxic language,
which is a common safety metric. However, users also tend to prefer
smarter assistants, so this process also improves the general
capabilities of AIs, such as their ability to classify, estimate,
reason, plan, write code, and so on. These more powerful AIs are indeed
more helpful to users, but also far more dangerous. Thus, it is not
enough to perform AI research that helps improve a safety metric or
achieve a specific safety goal—AI safety research needs to improve
safety <em>relative</em> to general capabilities.</p>
<p><strong>Empirical measurement of both safety and capabilities is
needed to establish that a safety intervention reduces overall AI
risk.</strong> Improving a facet of an AI’s safety often does
<em>not</em> reduce overall risk, as general capabilities advances can
often improve specific safety metrics. To reduce overall risk, a safety
metric needs to be improved relative to general capabilities. Both of
these quantities need to be empirically measured and contrasted.
Currently, most organizations proceed by gut feeling, appeals to
authority, and intuition to determine whether a safety intervention
would reduce overall risk. By objectively evaluating the effects of
interventions on safety metrics and capabilities metrics together,
organizations can better understand whether they are making progress on
safety relative to general capabilities.<p/>
Fortunately, safety and general capabilities are not identical. More
intelligent AIs may be more knowledgeable, clever, rigorous, and fast,
but this does not necessarily make them more just, power-averse, or
honest—an intelligent AI is not necessarily a beneficial AI. Several
research areas mentioned throughout this document improve safety
relative to general capabilities. For example, improving methods to
detect dangerous or undesirable behavior hidden inside AI systems does not
improve their general capabilities, such as the ability to code, but it
can greatly improve safety.</p>
<p>Research that empirically demonstrates an improvement of safety
relative to capabilities can reduce overall risk and help avoid
inadvertently accelerating AI development, fueling competitive
pressures, or hastening the onset of existential risks.</p>
<figure id="fig:swiss_cheese">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/swiss_cheese.png" class="tb-img-full"/>
<p class="tb-caption">Figure 1.2 The Swiss cheese model shows how technical factors can
improve organizational safety. Multiple layers of defense compensate for
each other’s individual weaknesses, leading to a low overall level of
risk.</p>
</figure>
<p><strong>Safetywashing can undermine genuine efforts to improve AI
safety.</strong> Organizations should be wary of “safetywashing”—the act
of overstating or misrepresenting one’s commitment to safety by
exaggerating the effectiveness of “safety” procedures, technical
methods, evaluations, and so forth. This phenomenon takes on various
forms and can contribute to a lack of meaningful progress in safety
research. For example, an organization may publicize their dedication to
safety while having a minimal number of researchers working on projects
that truly improve safety.<p/>
Misrepresenting capabilities developments as safety improvements is
another way in which safetywashing can manifest. For example, methods
that improve the reasoning capabilities of AI systems could be
advertised as improving their adherence to human values—since humans
might prefer the reasoning to be correct—but would mainly serve to
enhance general capabilities. By framing these advancements as
safety-oriented, organizations may mislead others into believing they
are making substantial progress in reducing AI risks when in reality,
they are not. It is crucial for organizations to accurately represent
their research to promote genuine safety and avoid exacerbating risks
through safetywashing practices.</p>
<p><strong>In addition to human factors, safe design principles can
greatly affect organizational safety.</strong> One example of a safe
design principle in organizational safety is the Swiss cheese model (as
shown in ), which is applicable in various domains, including AI. The
Swiss cheese model employs a multilayered approach to enhance the
overall safety of AI systems. This “defense in depth” strategy involves
layering diverse safety measures with different strengths and weaknesses
to create a robust safety system. Some of the layers that can be
integrated into this model include safety culture, red teaming, anomaly
detection, information security, and transparency. For example, red
teaming assesses system vulnerabilities and failure modes, while anomaly
detection works to identify unexpected or unusual system behavior and
usage patterns. Transparency ensures that the inner workings of AI
systems are understandable and accessible, fostering trust and enabling
more effective oversight. By leveraging these and other safety measures,
the Swiss cheese model aims to create a comprehensive safety system
where the strengths of one layer compensate for the weaknesses of
another. With this model, safety is not achieved with a monolithic
airtight solution, but rather with a variety of safety measures.<p/>
In summary, weak organizational safety creates many sources of risk. For
AI developers with weak organizational safety, safety is merely a matter
of box-ticking. They do not develop a good understanding of risks from
AI and may safetywash unrelated research. Their norms might be inherited
from academia (“publish or perish”) or startups (“move fast and break
things”), and their hires often do not care about safety. These norms
are hard to change once they have inertia, and need to be addressed with
proactive interventions.</p>
<br>
<div class="storybox">
<legend class="storyboxlegend">
<span><b>Story: Weak Safety Culture</b></span>
</legend>
An AI company is considering whether to train a new model. The company’s Chief Risk Officer (CRO),
hired only to comply with regulation, points out that the previous AI
system developed by the company demonstrates some concerning
capabilities for hacking. The CRO says that while the company’s approach
to preventing misuse is promising, it isn’t robust enough to be used for
much more capable AIs. The CRO warns that based on limited evaluation,
the next AI system could make it much easier for malicious actors to
hack into critical systems. None of the other company executives are
concerned, and say the company’s procedures to prevent malicious use
work well enough. One mentions that their competitors have done much
less, so whatever effort they do on this front is already going above
and beyond. Another points out that research on these safeguards is
ongoing and will be improved by the time the model is released.
Outnumbered, the CRO is persuaded to reluctantly sign off on the
plan.<p/>
A few months after the company releases the model, news breaks that a
hacker has been arrested for using the AI system to try to breach the
network of a large bank. The hack was unsuccessful, but the hacker had
gotten further than any other hacker had before, despite being
relatively inexperienced. The company quickly updates the model to avoid
providing the particular kind of assistance that the hacker used, but
makes no fundamental improvements.<p/>
Several months later, the company is deciding whether to train an even
larger system. The CRO says that the company’s procedures have clearly
been insufficient to prevent malicious actors from eliciting dangerous
capabilities from its models, and the company needs more than a band-aid
solution. The other executives say that to the contrary, the hacker was
unsuccessful and the problem was fixed soon afterwards. One says that
some problems just can’t be foreseen with enough detail to fix prior to
deployment. The CRO agrees, but says that ongoing research would enable
more improvements if the next model could only be delayed. The CEO
retorts, “That’s what you said the last time, and it turned out to be
fine. I’m sure it will work out, just like last time.”<p/>
After the meeting, the CRO decides to resign, but doesn’t speak out
against the company, as all employees have had to sign a
non-disparagement agreement. The public has no idea that concerns have
been raised about the company’s choices, and the CRO is replaced with a
new, more agreeable CRO who quickly signs off on the company’s
plans.<p/>
The company goes through with training, testing, and deploying its most
capable model ever, using its existing procedures to prevent malicious
use. A month later, revelations emerge that terrorists have managed to
use the system to break into government systems and steal nuclear and
biological secrets, despite the safeguards the company put in place. The
breach is detected, but by then it is too late: the dangerous
information has already proliferated.</p>
</div>
<br>
<br>
<h3>References</h3>
<div id="refs" class="references csl-bib-body" data-entry-spacing="0"
role="list">
<div id="ref-uri_35_2021" class="csl-entry" role="listitem">
<div class="csl-left-margin">[1] J.
Uri, <span>“35 <span>Years</span> <span>Ago</span>:
<span>Remembering</span> <span>Challenger</span> and <span>Her</span>
<span>Crew</span>,”</span> <em>NASA</em>. Jan. 2021.</div>
</div>
<div id="ref-iaea1992chernobyl" class="csl-entry" role="listitem">
<div class="csl-left-margin">[2] </div><div
class="csl-right-inline"><span>“The
<span>Chernobyl</span> accident: Updating of
<span>INSAG-1</span>,”</span> <span>International Atomic Energy
Agency</span>, Vienna, Austria, Technical Report INSAG-7, 1992.</div>
</div>
<div id="ref-Meselson1994TheSA" class="csl-entry" role="listitem">
<div class="csl-left-margin">[3] M.
Meselson <em>et al.</em>, <span>“The sverdlovsk anthrax outbreak of
1979.”</span> <em>Science</em>, vol. 266 5188, pp. 1202–8, 1994.</div>
</div>
<div id="ref-ziegler2019fine" class="csl-entry" role="listitem">
<div class="csl-left-margin">[4] D.
M. Ziegler <em>et al.</em>, <span>“Fine-tuning language models from
human preferences,”</span> <em>arXiv preprint arXiv:1909.08593</em>,
2019.</div>
</div>
<div id="ref-perrow1984normal" class="csl-entry" role="listitem">
<div class="csl-left-margin">[5] C.
Perrow, <em>Normal accidents: Living with high-risk technologies</em>.
Princeton, NJ: Princeton University Press, 1984.</div>
</div>
<div id="ref-Rogovin1980ThreeMI" class="csl-entry" role="listitem">
<div class="csl-left-margin">[6] M.
Rogovin and G. T. F. Jr., <span>“Three <span>Mile</span>
<span>Island</span>: A report to the commissioners and to the public.
<span>Volume</span> <span>I</span>,”</span> Nuclear Regulatory
Commission, Washington, DC (United States). Three Mile Island Special
Inquiry Group, NUREG/CR-1250(Vol.1), Jan. 1979.</div>
</div>
<div id="ref-rhodes1986making" class="csl-entry" role="listitem">
<div class="csl-left-margin">[7] R.
Rhodes, <em>The making of the atomic bomb</em>. New York: Simon &
Schuster, 1986.</div>
</div>
<div id="ref-Bubeck2023SparksOA" class="csl-entry" role="listitem">
<div class="csl-left-margin">[8] S.
Bubeck <em>et al.</em>, <span>“Sparks of artificial general
intelligence: Early experiments with GPT-4,”</span> <em>ArXiv</em>, vol.
abs/2303.12712, 2023.</div>
</div>
<div id="ref-Lidsky2003LeadNI" class="csl-entry" role="listitem">
<div class="csl-left-margin">[9] T.
I. Lidsky and J. S. Schneider, <span>“Lead neurotoxicity in children:
Basic mechanisms and clinical correlates.”</span> <em>Brain : a journal
of neurology</em>, vol. 126 Pt 1, pp. 5–19, 2003.</div>
</div>
<div id="ref-Mossman1990AsbestosSD" class="csl-entry" role="listitem">
<div class="csl-left-margin">[10] B.
T. Mossman, J. Y. Bignon, M. Corn, A. Seaton, and J. B. L. Gee,
<span>“Asbestos: Scientific developments and implications for public
policy.”</span> <em>Science</em>, vol. 247 4940, pp. 294–301,
1990.</div>
</div>
<div id="ref-moore2017radium" class="csl-entry" role="listitem">
<div class="csl-left-margin">[11] K.
Moore, <em>The radium girls: The dark story of america’s shining
women</em>. Naperville, IL: Sourcebooks, 2017.</div>
</div>
<div id="ref-Hecht1999TobaccoSC" class="csl-entry" role="listitem">
<div class="csl-left-margin">[12] S.
S. Hecht, <span>“Tobacco smoke carcinogens and lung cancer.”</span>
<em>Journal of the National Cancer Institute</em>, vol. 91 14, pp.
1194–210, 1999.</div>
</div>
<div id="ref-Molina1974StratosphericSF" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[13] M.
J. Molina and F. S. Rowland, <span>“Stratospheric sink for
chlorofluoromethanes: Chlorine atomc-atalysed destruction of
ozone,”</span> <em>Nature</em>, vol. 249, pp. 810–812, 1974.</div>
</div>
<div id="ref-Kim2011ThalidomideTT" class="csl-entry" role="listitem">
<div class="csl-left-margin">[14] J.
H. Kim and A. R. Scialli, <span>“Thalidomide: The tragedy of birth
defects and the effective treatment of disease.”</span>
<em>Toxicological sciences : an official journal of the Society of
Toxicology</em>, vol. 122 1, pp. 1–6, 2011.</div>
</div>
<div id="ref-Keles2019ASR" class="csl-entry" role="listitem">
<div class="csl-left-margin">[15] B.
Keles, N. McCrae, and A. Grealish, <span>“A systematic review: The
influence of social media on depression, anxiety and psychological
distress in adolescents,”</span> <em>International Journal of
Adolescence and Youth</em>, vol. 25, pp. 79–93, 2019.</div>
</div>
<div id="ref-Durumeric2014TheMO" class="csl-entry" role="listitem">
<div class="csl-left-margin">[16] Z.
Durumeric <em>et al.</em>, <span>“The matter of heartbleed,”</span>
<em>Proceedings of the 2014 Conference on Internet Measurement
Conference</em>, 2014.</div>
</div>
<div id="ref-Wang2022AdversarialPB" class="csl-entry" role="listitem">
<div class="csl-left-margin">[17] T.
T. Wang <em>et al.</em>, <span>“Adversarial policies beat
professional-level go AIs,”</span> <em>ArXiv</em>, vol. abs/2211.00241,
2022.</div>
</div>
<div id="ref-Laporte1991WorkingIP" class="csl-entry" role="listitem">
<div class="csl-left-margin">[18] T.
R. Laporte and P. M. Consolini, <span>“Working in practice but not in
theory: Theoretical challenges of <span>‘high-reliability
organizations’</span>,”</span> <em>Journal of Public Administration
Research and Theory</em>, vol. 1, pp. 19–48, 1991.</div>
</div>
<div id="ref-Dietterich2018RobustAI" class="csl-entry" role="listitem">
<div class="csl-left-margin">[19] T.
G. Dietterich, <span>“Robust artificial intelligence and robust human
organizations,”</span> <em>Frontiers of Computer Science</em>, vol. 13,
pp. 1–3, 2018.</div>
</div>
<div id="ref-leveson2016engineering" class="csl-entry" role="listitem">
<div class="csl-left-margin">[20] N.
G. Leveson, <em>Engineering a safer world: Systems thinking applied to
safety</em>. The MIT Press, 2016.</div>
</div>
<div id="ref-manheim" class="csl-entry" role="listitem">
<div class="csl-left-margin">[21] D.
Manheim, <span>“Building a culture of safety for AI: Perspectives and
challenges,”</span> <em>SSRN</em>. 2023.</div>
</div>
<div id="ref-national2014lessons" class="csl-entry" role="listitem">
<div class="csl-left-margin">[22] N.
R. Council, D. on Earth, L. Studies, Nuclear, R. S. Board, and Committee
on Lessons Learned from the Fukushima Nuclear Accident for Improving
Safety and Security of U.S. Nuclear Plants, <em>Lessons
<span>Learned</span> from the <span>Fukushima</span>
<span>Nuclear</span> <span>Accident</span> for <span>Improving</span>
<span>Safety</span> of <span>U</span>.<span>S</span>.
<span>Nuclear</span> <span>Plants</span></em>. Washington, D.C.:
National Academies Press, 2014.</div>
</div>
<div id="ref-vaughan1996challenger" class="csl-entry" role="listitem">
<div class="csl-left-margin">[23] D.
Vaughan, <em>The challenger launch decision: Risky technology, culture,
and deviance at NASA</em>. Chicago, IL: University of Chicago Press,
1996.</div>
</div>
<div id="ref-lamothe_air_2014" class="csl-entry" role="listitem">
<div class="csl-left-margin">[24] D.
Lamothe, <span>“Air <span>Force</span> <span>Swears</span>:
<span>Our</span> <span>Nuke</span> <span>Launch</span> <span>Code</span>
<span>Was</span> <span>Never</span> ’00000000’,”</span> <em>Foreign
Policy</em>. Jan. 2014.</div>
</div>
<div id="ref-ord2020precipice" class="csl-entry" role="listitem">
<div class="csl-left-margin">[25] T.
Ord, <em>The precipice: Existential risk and the future of
humanity</em>. Hachette Books, 2020.</div>
</div>
<div id="ref-NRC2011FR" class="csl-entry" role="listitem">
<div class="csl-left-margin">[26] U.
S. N. R. Commission, <span>“Final safety culture policy
statement,”</span> vol. 76. Federal Register, p. 34773, 2011.</div>
</div>
<div id="ref-schneier2008security" class="csl-entry" role="listitem">
<div class="csl-left-margin">[27] B.
Schneier, <span>“Inside the twisted mind of the security
professional,”</span> <em>Wired</em>, 2008.</div>
</div>
<div id="ref-Hendrycks2022XRiskAF" class="csl-entry" role="listitem">
<div class="csl-left-margin">[28] D.
Hendrycks and M. Mazeika, <span>“X-risk analysis for AI
research,”</span> <em>ArXiv</em>, vol. abs/2206.05862, 2022.</div>
</div>
</div>
</body>
</html>