几百万个 SAE 特征，真的能拿来用吗？ (2026)

节目导览

本期精读 arXiv:2606.26620「Discovering Millions of Interpretable Features with Sparse Autoencoders」。这篇论文发布 Qwen3-Instruct SAE，把 Qwen3 指令模型族上的 sparse autoencoder 资源做成可复用的研究基础设施。

本期你会听到

为什么 Qwen3-1.7B、4B、8B 上的大规模 SAE 发布，对 mechanistic interpretability 不是「又一个 feature 案例」，而是一套可比较的公共坐标系。
residual stream、MLP output、attention output 三个位置的 SAE 为什么要分开看；为什么 FVE 高，不等于模型行为恢复得好。
16K 与 65K 字典、不同 L0 稀疏度下的 trade-off：更多 feature 可能带来更细粒度，也可能加剧 feature splitting 与 absorption。
refusal steering 实验为什么证明了 feature 的因果操控价值，但还不能被理解成「安全按钮」。

来源

He et al. 2026, Discovering Millions of Interpretable Features with Sparse Autoencoders

音频说明

本集为双人中文技术精读播客，约 8 分钟。章节、时间轴与逐句文字稿随音频一同发布。

节目导览

本期你会听到

来源

音频说明

Related content