-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathalignment
224 lines (157 loc) · 9.14 KB
/
alignment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
OpenAI 发布对齐研究工作合集
https://zhuanlan.zhihu.com/p/622188645
OpenAI 预测七年内超级 AI 将问世,并宣布「20% 算力用来解决失控问题」,哪些信息值得关注?
https://www.zhihu.com/question/610639130/answers/updated
https://openai.com/blog/introducing-superalignment
OpenAI发布全球招募令打造全新团队“Superalignment”,誓在四年内解决超级人工智能的引导和控制问题
https://zhuanlan.zhihu.com/p/641817822
用AI对齐AI?超级对齐团队领导人详解OpenAI对齐超级智能四年计划
https://zhuanlan.zhihu.com/p/649441164
GPT-4一天顶6个月,人类审核要失业?OpenAI凌晨发布重磅升级,或大量取代人类审核员
https://zhuanlan.zhihu.com/p/650387533
Using GPT-4 for content moderation
https://openai.com/blog/using-gpt-4-for-content-moderation
Open AI 曝光 GPT-4 这一新功能「一天可以完成六个月内容审核的工作」,将产生哪些影响?
https://www.zhihu.com/question/617524795
Ten Levels of AI Alignment Difficulty
https://www.lesswrong.com/posts/EjgfreeibTXRx9Ham/ten-levels-of-ai-alignment-difficulty
万字长文概览大语言模型对齐(欺骗性对齐、可扩展的监管、机械可解释性、工具性目标趋同
https://zhuanlan.zhihu.com/p/643161870
人机对齐概述|13. 人机对齐问题的核心概念
https://zhuanlan.zhihu.com/p/630894776
Anthropic关于AI安全的核心观点:何时、何故、何事与如何
https://zhuanlan.zhihu.com/p/626097959
Quintic AI多图解读ChatGPT的各类失败案例
https://zhuanlan.zhihu.com/p/621986033
Weak-to-strong generalization
https://openai.com/research/weak-to-strong-generalization
OpenAI超级对齐论文WEAK-TO-STRONG GENERALIZATION精读与梗概
https://zhuanlan.zhihu.com/p/672715535
Superalignment Fast Grants
https://openai.com/blog/superalignment-fast-grants
https://openai.notion.site/Research-directions-0df8dd8136004615b0936bf48eb6aeb8
要研究深度学习的可解释性(Interpretability),应从哪几个方面着手?
https://www.zhihu.com/question/320688440
对Hugging Face开源模型精准投毒!LLM切脑后变身PoisonGPT,用虚假事实洗脑60亿人
https://zhuanlan.zhihu.com/p/642616786
Overview of Model Editing
https://zhuanlan.zhihu.com/p/609177437
Knowledge Neurons in Pretrained Transformers 北大-微软使用积分梯度从Transformer的FFN层提取“知识神经元”
https://zhuanlan.zhihu.com/p/611481317
Locating and Editing Factual Associations in GPT
https://blog.csdn.net/qq_28385535/article/details/128312436
https://mp.weixin.qq.com/s?__biz=MzI4MDYzNzg4Mw==&mid=2247554176&idx=3&sn=08759b617e3cf11f9fdedab3a97346e3&chksm=ebb72c54dcc0a54281cfef69a230f3c0f9e9e576517912b927efb152547c236a92e432c3eb10&scene=27
https://arxiv.org/abs/2202.05262
Locating and Editing Factual Associations in GPT翻译
https://blog.csdn.net/qq_28385535/article/details/128312436
Transformer Feed-Forward Layers Are Key-Value Memories
https://zhuanlan.zhihu.com/p/611278136
https://arxiv.org/abs/2012.14913
Can LMs Learn New Entities from Descriptions? Challenges in Propagating Injected Knowledge
https://arxiv.org/abs/2305.01651
MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
https://arxiv.org/abs/2305.14795
Decouple knowledge from paramters for plug-and-play language modeling
https://arxiv.org/abs/2305.11564
https://github.com/hannibal046/pluglm
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
https://arxiv.org/abs/2304.14767
Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering
https://arxiv.org/abs/2204.04581
Inseq: An Interpretability Toolkit for Sequence Generation Models
https://arxiv.org/abs/2302.13942
Explaining How Transformers Use Context to Build Predictions
https://arxiv.org/abs/2305.12535
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
https://arxiv.org/abs/2203.14680
Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of GPT
https://arxiv.org/abs/2305.13417
https://github.com/shacharkz/visualizing-the-information-flow-of-gpt
Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute
https://arxiv.org/abs/2301.10448
RARR: Researching and Revising What Language Models Say, Using Language Models
https://arxiv.org/abs/2210.08726
Complex Claim Verification with Evidence Retrieved in the Wild
https://arxiv.org/abs/2305.11859
Using Natural Language Explanations to Rescale Human Judgments
https://arxiv.org/abs/2305.14770
When to Read Documents or QA History: On Unified and Selective Open-domain QA
https://arxiv.org/abs/2306.04176
Augmenting Self-attention with Persistent Memory
https://arxiv.org/abs/1907.01470
Editing Large Language Models: Problems, Methods, and Opportunities
https://arxiv.org/abs/2305.13172
大模型知识Out该怎么办?浙大团队探索大模型参数更新的方法—模型编辑
https://www.php.cn/faq/552888.html
Eliciting Latent Predictions from Transformers with the Tuned Lens
https://arxiv.org/abs/2303.08112
积分梯度:一种新颖的神经网络可视化方法
https://www.spaces.ac.cn/archives/7533
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
https://arxiv.org/abs/2203.14680
LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models
https://arxiv.org/abs/2204.12130
https://github.com/mega002/lm-debugger
interpreting GPT: the logit lens
https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
Finding Skill Neurons in Pre-trained Transformer-based Language Models
https://arxiv.org/abs/2211.07349
Emergent Modularity in Pre-trained Transformers
https://arxiv.org/abs/2305.18390
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
https://arxiv.org/abs/2305.08746
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
https://arxiv.org/abs/2301.04213
Augmenting Language Models with Long-Term Memory
https://arxiv.org/abs/2306.07174
https://zhuanlan.zhihu.com/p/639000130
https://github.com/Victorwz/LongMem
增强模型的记忆能力- Memorizing Transformers
https://zhuanlan.zhihu.com/p/651891213
https://arxiv.org/abs/2203.08913
《 Focused Transformer: Contrastive Training for Context Scaling》阅读笔记
https://zhuanlan.zhihu.com/p/642869077
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
https://arxiv.org/abs/2309.05973
世界的参数倒影:为何GPT通过Next Token Prediction可以产生智能
https://zhuanlan.zhihu.com/p/632795115
Towards Automated Circuit Discovery for Mechanistic Interpretability
https://arxiv.org/abs/2304.14997
分解大模型的神经元!Claude团队最新研究火了,网友:打开黑盒
https://zhuanlan.zhihu.com/p/659898917
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
https://transformer-circuits.pub/2023/monosemantic-features/index.html
RLHF半年工作速览
https://zhuanlan.zhihu.com/p/640350234
Guiding LLM to Fool Itself: Automatically Manipulating Machine Reading Comprehension Shortcut Triggers
https://arxiv.org/abs/2310.18360
Ilya参与,OpenAI给GPT-4搞可解释,提取了1600万个特征,还能看它怎么想
https://zhuanlan.zhihu.com/p/702193432
Extracting Concepts from GPT-4
https://openai.com/index/extracting-concepts-from-gpt-4/
请勿用于非法用途!切除Qwen安全审查记录!给LLM动手术!生成任何想要的内容!适用所有大模型!
https://zhuanlan.zhihu.com/p/704525000
Mapping the Mind of a Large Language Model
https://www.anthropic.com/research/mapping-mind-language-model
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
https://transformer-circuits.pub/2024/scaling-monosemanticity/index
「论文速读」Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
https://zhuanlan.zhihu.com/p/702204376
大语言模型的黑匣子首次被打开,发现Claude 3 Sonnet的内部特征
https://baijiahao.baidu.com/s?id=1800177149644519317&wfr=spider&for=pc
Finding GPT-4’s mistakes with GPT-4
https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/
https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf
GPT-4批评GPT-4实现「自我提升」!OpenAI前超级对齐团队又一力作被公开
https://zhuanlan.zhihu.com/p/705966422
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
https://arxiv.org/abs/2401.05566
当心!不要教大模型骗人,研究表明AI变坏后,很难被纠正
https://baijiahao.baidu.com/s?id=1789467200001404263&wfr=spider&for=pc
“AI学会欺骗,人类完蛋了”?看完Anthropic的论文,我发现根本不是这回事啊
https://baijiahao.baidu.com/s?id=1788780652063489525&wfr=spider&for=pc
Simple probes can catch sleeper agents
https://www.anthropic.com/research/probes-catch-sleeper-agents
OpenAI发布PVG:用小模型验证大模型输出,解决“黑盒”难题
https://www.163.com/dy/article/J7C5B8JO0512B07B.html
https://openai.com/index/prover-verifier-games-improve-legibility/