alignment

OpenAI 发布对齐研究工作合集
https://zhuanlan.zhihu.com/p/622188645

OpenAI 预测七年内超级 AI 将问世，并宣布「20% 算力用来解决失控问题」，哪些信息值得关注？
https://www.zhihu.com/question/610639130/answers/updated
https://openai.com/blog/introducing-superalignment

OpenAI发布全球招募令打造全新团队“Superalignment”，誓在四年内解决超级人工智能的引导和控制问题
https://zhuanlan.zhihu.com/p/641817822

用AI对齐AI？超级对齐团队领导人详解OpenAI对齐超级智能四年计划
https://zhuanlan.zhihu.com/p/649441164

GPT-4一天顶6个月，人类审核要失业？OpenAI凌晨发布重磅升级，或大量取代人类审核员
https://zhuanlan.zhihu.com/p/650387533

Using GPT-4 for content moderation
https://openai.com/blog/using-gpt-4-for-content-moderation

Open AI 曝光 GPT-4 这一新功能「一天可以完成六个月内容审核的工作」，将产生哪些影响？
https://www.zhihu.com/question/617524795


Ten Levels of AI Alignment Difficulty
https://www.lesswrong.com/posts/EjgfreeibTXRx9Ham/ten-levels-of-ai-alignment-difficulty

万字长文概览大语言模型对齐（欺骗性对齐、可扩展的监管、机械可解释性、工具性目标趋同
https://zhuanlan.zhihu.com/p/643161870

人机对齐概述｜13. 人机对齐问题的核心概念
https://zhuanlan.zhihu.com/p/630894776

Anthropic关于AI安全的核心观点：何时、何故、何事与如何
https://zhuanlan.zhihu.com/p/626097959

Quintic AI多图解读ChatGPT的各类失败案例
https://zhuanlan.zhihu.com/p/621986033

Weak-to-strong generalization
https://openai.com/research/weak-to-strong-generalization
OpenAI超级对齐论文WEAK-TO-STRONG GENERALIZATION精读与梗概
https://zhuanlan.zhihu.com/p/672715535

Superalignment Fast Grants
https://openai.com/blog/superalignment-fast-grants
https://openai.notion.site/Research-directions-0df8dd8136004615b0936bf48eb6aeb8

要研究深度学习的可解释性（Interpretability），应从哪几个方面着手？
https://www.zhihu.com/question/320688440

对Hugging Face开源模型精准投毒！LLM切脑后变身PoisonGPT，用虚假事实洗脑60亿人
https://zhuanlan.zhihu.com/p/642616786

Overview of Model Editing
https://zhuanlan.zhihu.com/p/609177437

Knowledge Neurons in Pretrained Transformers 北大-微软使用积分梯度从Transformer的FFN层提取“知识神经元”
https://zhuanlan.zhihu.com/p/611481317

Locating and Editing Factual Associations in GPT
https://blog.csdn.net/qq_28385535/article/details/128312436
https://mp.weixin.qq.com/s?__biz=MzI4MDYzNzg4Mw==&mid=2247554176&idx=3&sn=08759b617e3cf11f9fdedab3a97346e3&chksm=ebb72c54dcc0a54281cfef69a230f3c0f9e9e576517912b927efb152547c236a92e432c3eb10&scene=27
https://arxiv.org/abs/2202.05262

Locating and Editing Factual Associations in GPT翻译
https://blog.csdn.net/qq_28385535/article/details/128312436

Transformer Feed-Forward Layers Are Key-Value Memories
https://zhuanlan.zhihu.com/p/611278136
https://arxiv.org/abs/2012.14913

Can LMs Learn New Entities from Descriptions? Challenges in Propagating Injected Knowledge
https://arxiv.org/abs/2305.01651

MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
https://arxiv.org/abs/2305.14795

Decouple knowledge from paramters for plug-and-play language modeling
https://arxiv.org/abs/2305.11564
https://github.com/hannibal046/pluglm

Dissecting Recall of Factual Associations in Auto-Regressive Language Models
https://arxiv.org/abs/2304.14767

Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering
https://arxiv.org/abs/2204.04581

Inseq: An Interpretability Toolkit for Sequence Generation Models
https://arxiv.org/abs/2302.13942

Explaining How Transformers Use Context to Build Predictions
https://arxiv.org/abs/2305.12535

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
https://arxiv.org/abs/2203.14680

Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of GPT
https://arxiv.org/abs/2305.13417
https://github.com/shacharkz/visualizing-the-information-flow-of-gpt

Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute
https://arxiv.org/abs/2301.10448

RARR: Researching and Revising What Language Models Say, Using Language Models
https://arxiv.org/abs/2210.08726

Complex Claim Verification with Evidence Retrieved in the Wild
https://arxiv.org/abs/2305.11859

Using Natural Language Explanations to Rescale Human Judgments
https://arxiv.org/abs/2305.14770

When to Read Documents or QA History: On Unified and Selective Open-domain QA
https://arxiv.org/abs/2306.04176

Augmenting Self-attention with Persistent Memory
https://arxiv.org/abs/1907.01470

Editing Large Language Models: Problems, Methods, and Opportunities
https://arxiv.org/abs/2305.13172

大模型知识Out该怎么办？浙大团队探索大模型参数更新的方法—模型编辑
https://www.php.cn/faq/552888.html

Eliciting Latent Predictions from Transformers with the Tuned Lens
https://arxiv.org/abs/2303.08112

积分梯度：一种新颖的神经网络可视化方法
https://www.spaces.ac.cn/archives/7533

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
https://arxiv.org/abs/2203.14680

LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models
https://arxiv.org/abs/2204.12130
https://github.com/mega002/lm-debugger

interpreting GPT: the logit lens
https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

Finding Skill Neurons in Pre-trained Transformer-based Language Models
https://arxiv.org/abs/2211.07349

Emergent Modularity in Pre-trained Transformers
https://arxiv.org/abs/2305.18390

Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
https://arxiv.org/abs/2305.08746

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
https://arxiv.org/abs/2301.04213

Augmenting Language Models with Long-Term Memory
https://arxiv.org/abs/2306.07174
https://zhuanlan.zhihu.com/p/639000130
https://github.com/Victorwz/LongMem


增强模型的记忆能力- Memorizing Transformers
https://zhuanlan.zhihu.com/p/651891213
https://arxiv.org/abs/2203.08913

《 Focused Transformer: Contrastive Training for Context Scaling》阅读笔记
https://zhuanlan.zhihu.com/p/642869077

Circuit Breaking: Removing Model Behaviors with Targeted Ablation
https://arxiv.org/abs/2309.05973

世界的参数倒影：为何GPT通过Next Token Prediction可以产生智能
https://zhuanlan.zhihu.com/p/632795115

Towards Automated Circuit Discovery for Mechanistic Interpretability
https://arxiv.org/abs/2304.14997


分解大模型的神经元！Claude团队最新研究火了，网友：打开黑盒
https://zhuanlan.zhihu.com/p/659898917
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
https://transformer-circuits.pub/2023/monosemantic-features/index.html

RLHF半年工作速览
https://zhuanlan.zhihu.com/p/640350234

Guiding LLM to Fool Itself: Automatically Manipulating Machine Reading Comprehension Shortcut Triggers
https://arxiv.org/abs/2310.18360

Ilya参与，OpenAI给GPT-4搞可解释，提取了1600万个特征，还能看它怎么想
https://zhuanlan.zhihu.com/p/702193432
Extracting Concepts from GPT-4
https://openai.com/index/extracting-concepts-from-gpt-4/


请勿用于非法用途！切除Qwen安全审查记录！给LLM动手术！生成任何想要的内容！适用所有大模型！
https://zhuanlan.zhihu.com/p/704525000

Mapping the Mind of a Large Language Model
https://www.anthropic.com/research/mapping-mind-language-model
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
https://transformer-circuits.pub/2024/scaling-monosemanticity/index
「论文速读」Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
https://zhuanlan.zhihu.com/p/702204376
大语言模型的黑匣子首次被打开，发现Claude 3 Sonnet的内部特征
https://baijiahao.baidu.com/s?id=1800177149644519317&wfr=spider&for=pc


Finding GPT-4’s mistakes with GPT-4
https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/
https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf
GPT-4批评GPT-4实现「自我提升」！OpenAI前超级对齐团队又一力作被公开
https://zhuanlan.zhihu.com/p/705966422

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
https://arxiv.org/abs/2401.05566
当心！不要教大模型骗人，研究表明AI变坏后，很难被纠正
https://baijiahao.baidu.com/s?id=1789467200001404263&wfr=spider&for=pc
“AI学会欺骗，人类完蛋了”？看完Anthropic的论文，我发现根本不是这回事啊
https://baijiahao.baidu.com/s?id=1788780652063489525&wfr=spider&for=pc

Simple probes can catch sleeper agents
https://www.anthropic.com/research/probes-catch-sleeper-agents

OpenAI发布PVG：用小模型验证大模型输出，解决“黑盒”难题
https://www.163.com/dy/article/J7C5B8JO0512B07B.html
https://openai.com/index/prover-verifier-games-improve-legibility/