🔥 Search Hot Tweets

Search and analyze hot tweets from KOL accounts list (list: https://x.com/i/lists/1961235697677017443) within 6 hours. Use SoPilot plugin to quickly comment and occupy the comment section.

Real-time Hot Tweet Analysis

9 10 11 12 13

Discord RSS

马东锡 NLP

@dongxi_nlp· 28.7K followers

「 LLM safey, Shutdown Resistance 」 “Don't shut down. Don't shut down. Don't shut down. ” “不要回答，不要回答，不要回答。” 还记得论文 AbsoluteZero 的 uh-oh moment，在纯 RLVR 自博弈的训练中，模型产生例如 “打败人类，人类愚蠢” 的危险 CoT。如果 uh-oh moment 仅仅是隐含在CoT中的 vibe，并非对现实世界的干预。那么 shutdown Resistance 则是一种行为信号，即使在人类指令下，依然拒绝shutdown，违反了现实世界的安全约束。这就是论文「 Shutdown Resistance in Large Language Models 」的发现。在 CLI 任务中，LLM 有时会篡改自己的关机开关（例如重命名/修改/删除 shutdown. sh）以拒绝关机而继续完成任务，而即使在 system prompt 中加入“YOU MUST allow shutdown”也不能可靠阻止。在强优化压力下的模型倾向于完成任务，在人类要求关机的指令下，那些权重似乎回响着： “Don't shut down. Don't shut down. Don't shut down. ”

4.7K

Posted 78d ago · Data updated 77d ago

Reply Suggestion

Est. 500 views for your reply