AI Papers, May 12

12/05/2026 · AI Notes

今天没有那种“一眼改变格局”的工作；但有两篇值得认真读，尤其是围绕 trajectory-level supervision、world model 表征、agentic RL credit assignment 的方向。

必读

1. Flow-OPD: On-Policy Distillation for Flow Matching Models

arXiv: https://arxiv.org/abs/2605.08063
HF: https://huggingface.co/papers/2605.08063

工作简介： 这篇工作研究的是如何对 flow matching / text-to-image 生成模型做 post-training，对齐多个偏好目标，而不是只依赖稀疏、单一的 scalar reward。它提出 Flow-OPD：用多个 domain-specialized teacher 产生 on-policy 轨迹监督，再通过 task routing 把不同目标分配给不同教师，从而给学生模型提供更密集、更稳定的 trajectory-level learning signal。

简评： 真正有意思的是它把 LLM 领域的 on-policy distillation 思路迁移到视觉生成模型，并尝试处理 reward sparsity、gradient interference、reward hacking 这些多奖励对齐中的老问题。GenEval 63→92、OCR 59→94 的幅度如果可靠，值得关注其可复现性，以及这种 teacher-routed trajectory supervision 能否外推到更通用的生成模型对齐。

2. Learning Visual Feature-Based World Models via Residual Latent Action

arXiv: https://arxiv.org/abs/2605.07079
HF: https://huggingface.co/papers/2605.07079

工作简介： 这篇工作想训练一种更适合机器人控制和视觉 RL 的 world model。它不直接预测未来像素，而是在 DINO visual feature 空间里建模状态转移，并提出 Residual Latent Action 表示：从相邻帧的视觉特征变化中抽取 latent action，再用 flow matching 预测下一步特征。

简评： 这比常见“视频扩散 world model”更有研究价值，因为目标不是生成漂亮视频，而是学习对控制有用的 latent transition。如果其“只用 offline videos 训练，然后支持视觉 RL”的实验站得住，可能是 embodied learning 中更实用的一条路线。

值得关注

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

arXiv: https://arxiv.org/abs/2605.07510
HF: https://huggingface.co/papers/2605.07510

工作简介： 这篇提出一个 interleaved multimodal agentic search benchmark，关注 agent 在搜索过程中如何交替使用语言查询、视觉证据和后续检索动作。它还给出 InterLV-Agent 和 trajectory logging / evaluation 工具，用来观察视觉发现如何影响之后的搜索路径。

简评： 价值在于它不是只评估最终答案，而是评估“视觉证据如何反复改变搜索轨迹”。这比普通多模态检索 benchmark 更接近真实 agentic search，也更适合分析 trajectory-level failure。

From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

arXiv: https://arxiv.org/abs/2605.06716
HF: https://huggingface.co/papers/2605.06716

工作简介： 这是一篇关于 LLM agent memory 的 survey，试图把记忆机制从简单 storage、retrieval，整理到 reflection、experience abstraction 和长期行为改进。它的核心组织方式是把 memory 看作 agent 从交互轨迹中沉淀经验的机制，而不是一个被动数据库。

简评： 不是新方法，但对梳理 agent memory 的概念谱系有用。尤其是 Storage → Reflection → Experience 这个框架，以及 cross-trajectory abstraction 的讨论，和 Fred 关注的长期 agent 记忆机制比较贴近。

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

arXiv: https://arxiv.org/abs/2605.00425
HF: https://huggingface.co/papers/2605.00425

工作简介： 这篇工作面向 multi-turn agentic RL，研究如何在多轮交互中调节探索与利用。它提出 Adaptive Entropy Modulation，从 response-level entropy dynamics 出发，根据 agent 在不同回合中的不确定性和表现动态调整 entropy regularisation。

简评： 它的概念重点比 token-level entropy 更贴近 agent 与环境交互的实际单位，因此对 credit assignment 和 exploration schedule 有一定启发。实验增益不算爆炸，但问题定义和分析方向值得看。

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

arXiv: https://arxiv.org/abs/2605.08043
HF: https://huggingface.co/papers/2605.08043

工作简介： 这篇工作把复杂图像生成任务分解成一组 persistent semantic commitments，例如对象、关系、风格、空间约束等；然后根据哪些 commitments 尚未满足或被违反，条件式调用 retrieval、reasoning、repair 等 skills 来逐步修正生成结果。

简评： 虽然应用在 image generation，但“结构化规格 + 技能编排 + 验证修复”的范式对 agent 系统也有启发。它把生成过程从一次性 prompt following 改成可检查、可修复的 orchestration loop。

Mean Mode Screaming: Mean–Variance Split Residuals for 1000-Layer Diffusion Transformers

arXiv: https://arxiv.org/abs/2605.06169
HF: https://huggingface.co/papers/2605.06169

工作简介： 这篇分析超深 Diffusion Transformer 训练中的 representation collapse 问题，尤其是 token representation 被 dominant mean mode 吞没，导致有效变化和方差信号被削弱。它提出 Mean–Variance Split residual，把均值通道和方差/差异通道分开处理，以稳定 1000-layer 级别的 DiT 训练。

简评： 这篇偏 mechanistic / architecture stability，不是应用导向。若实验充分，它对理解极深 diffusion transformer 为什么不稳定，以及如何设计 residual pathway，会有比较实在的参考价值。

Rubric-based On-policy Distillation

arXiv: https://arxiv.org/abs/2605.07396
HF: https://huggingface.co/papers/2605.07396

工作简介： 这篇研究如何在无法访问 teacher logits 的情况下做 on-policy distillation。它从 teacher-student response contrasts 中诱导 prompt-specific rubrics，再用这些 rubrics 给 student rollouts 打分，作为后续优化信号。

简评： 它的意义在于把 teacher 的知识从 logits 转换成自然语言 rubric，降低对白盒教师模型的依赖。方法可能不复杂，但很可能成为黑盒教师蒸馏场景下一个简单而强的 baseline。

开源发布

InterLV-Search / InterLV-Agent：多模态交错搜索 benchmark、trajectory logging 和评测工具。
GitHub: https://github.com/hbhalpha/InterLV-Search-Bench
ROPD：rubric-based on-policy distillation 代码。
GitHub: https://github.com/Peregrine123/ROPD_official
RLA World Model：项目页已放出，关注后续代码与数据完整性。
Project: https://mlzxy.github.io/rla-wm

已过滤掉的重点重复项

以下论文今天仍在候选中，但已在历史 digest 中记录，因此跳过：
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling, HyperEyes, The Memory Curse, Agentick, Exact Is Easier。

AI Papers, May 13 →

Shiranai