Lex Fridman Podcast · Podcast Transcript

#490 – State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI

Sunday, February 1, 2026 · Original episode ↗

Speaker 1

以下是关于人工智能（A_I_）最先进技术的对话，包括过去一年中一些令人兴奋的技术突破和发展，以及我们认为在即将到来的一年中可能发生的一些有趣事情。有时候内容会变得非常技术性，但我们尽量确保它对非专业人士也能保持可理解，而不至于过于简单化。能够与我在A_I_社区中最喜欢的两个人一起做这样的节目，真是无比荣幸和愉快，他们就是Sebastian Raschka和Nathan Lambert。

他们都是备受尊敬的机器学习研究人员和工程师，同时也是出色的沟通者、教育者、作家和Twitter用户。Sebastian是两本我强烈推荐给初学者和专家的书的作者。第一本是《从零开始构建大型语言模型》，第二本是《从零开始构建推理模型》。我真心相信，在机器学习和计算机科学领域，学习和理解某件事的最佳方式就是从零开始自己动手构建。

Nathan是艾伦人工智能研究所的后训练（post-training）负责人，并且是关于人类反馈强化学习（reinforcement learning from human feedback）的权威书籍的作者。他们两人都有很棒的X_账户和sub-stacks，Sebastian在YouTube上有课程，Nathan有播客，大家绝对应该关注这些。

现在快速提一下每个赞助商，可以在描述中或在LexFriedman.com/sponsors查看。这实际上是支持这个播客的最佳方式。我们有一堆很棒的赞助商，Box用于智能内容管理，Quo用于您的电话系统，比如通话、短信、联系人管理，Uplift桌子是我最喜欢的办公桌，Thin用于客户服务A_I_代理，shop flat用于在线销售商品，code rabbit用于A_I_驱动的代码审查，element用于电解质，当然还有我们长期的朋友plexity，用于好奇心驱动的知识探索。选择明智，我的朋友们。现在进入完整的广告阅读。我尽量让它们有趣，但如果你跳过，请仍然查看赞助商。我喜欢他们的产品，也许你也会。无论出于什么原因想联系我，请访问Lex room dot com slash contact。如果你看不出来，我现在试图让自己精神振奋，因为昨晚熬夜，几乎没怎么睡，所以我现在精疲力竭，神志恍惚，快乐，不确定什么是现实，什么是梦境。事实上，我们现在可能正生活在一个梦中。我经历了很多，工作时间疯狂，事情太多，我感到不堪重负，当然，始终对活着充满感激和快乐，但无法发布我想要的那么多集，所以有很多赞助商需要补上。你的支持对我意义重大。请查看所有赞助商，如果你觉得对你有用，购买他们的产品，这确实是支持这个播客的最佳方式。好了，我们开始吧。首先，这一集由Box赞助，这是一个基于云的内容管理、文件共享和各种协作的平台。对于很多公司来说，大问题是如何利用A_I_来提升业务。很多公司只是为了炒作和标签使用它。看着人们说“由A_I_驱动”真是有趣，我不在乎你是不是一家面包店。但撇开所有炒作不谈，这是人类创造的最不可思议的东西之一。因此，能够很好地利用这一点的公司就是赢家。当然，Box因其文件和内容管理而闻名，尤其是在规模上。所以显然它适合利用A_I_来帮助自动化一些文档、一些工作流程、一些组织工作，他们在这方面做得非常出色。他们有一个系统，正如你所想象的，叫做Box A_I_，就是这样。我喜欢它在界面方面的出色实现，后端一切都运作得非常好。今天就在你的组织中推广A_I_，访问box.com/A_I_，就是box.com/A_I_了解更多信息。

这一集还由Quo赞助，拼写为Q_U_O_。这也是一个只有三个字母的公司名称，可以帮助你在拼字游戏中获胜。你可以在拼字游戏中使用公司名称吗？Q_多少分？U_多少分？我想象很多。当我第一次学习英语时，这一直是我的一个大困惑。总觉得Q_应该在字母表的末尾，也许像Q_Z_。对我有限的脑容量来说，Q_总是在字母表中较早的位置。是什么，O_P_Q_？我甚至不能在字母表中定位字母，我相信很多人都是这样，除非在脑海中按顺序读字母表。所有这些都与短期和长期记忆访问、人类认知的功能和限制有关，也许与认知系统总体上有关。所有这些都与这一集特别相关，但与我应该谈论的Quo的精彩之处无关。正如往常一样，我认为这里的重点，生活的重点，是从心底谈论你想谈论的任何事情，这就是我尝试对所有事情做的，并且更广泛地说，想说话时就说话，想闭嘴时就闭嘴倾听。我更喜欢倾听多于说话。在这里插入一个聪明的过渡，因为谈话在某种程度上是相关的。Quo以前被称为open phone，帮助超过九万家企业管理电话、短信、联系人，所有与业务相关的电话事务。你有一堆客户，一堆来电，一堆业务人员需要接听这些电话，需要管理它，特定请求的状态是什么，语音邮件转录，所有这些东西，显然A_I_的有效利用使其非常高效。但对于这样的事情来说，真正重要的是界面好，团队协作好，而Quo在这方面表现出色。免费试用Quo并获得前六个月20%的折扣，访问quote.com/lex，就是Q_U_O_ dot com slash Lex。告诉你的朋友，因为它可能会帮助他们在拼字游戏中获胜。说到拼字游戏，你通常想在桌子上玩拼字游戏。这是一个如此神奇的体验。我刚刚有一个遥远过去的幻象，看到自己和一个朋友坐在桌子上玩拼字游戏。生活是什么？充满美丽的回忆。然后它很快就结束了。是的，那种忧郁的感觉很美。我想插入另一个聪明的过渡，像Mark Norman那样，因为下一个公司的名字是Uplift Desk。正如我所说，这是我最喜欢的办公桌，也是我用于播客家具的桌子。我已经数不清了。我有很多Uplift桌子，站立桌子在我家到处都是。到处都是桌子。我有一个放在地板上的床垫和Uplift桌子。我有一个用于机器人技术的Linux机器。我有一台用于大量编辑的机器。所有这些都在桌子上。我有三个用于播客的桌子，就是你过去几年看到的那个。那都是Uplift桌子。我通常不把它们放在站立模式，但它们是

任务让我能够做各种事情，非常容易使用，材料很好，非常坚固。我就是喜欢上传任务的一切。当他们说想要赞助我时，我已经用了很多年，我简直不敢相信。我喜欢这种情况，我爱上一家公司，爱上他们的产品这么长时间，然后我还能为他们歌功颂德。我的意思是，拜托，你接下来要告诉我 FFMPEG（一个开源项目）想要赞助这个播客吗？这不是我一直以来爱的公司。无论如何，访问 upliftdesk.com/lex 并使用代码 Lex 可获得四个免费配件、免费当天送货、免费退货、十五年保修和整个订单的额外折扣。那是 UPLIFTDESK.com/lex。拼出来真的对任何人有帮助吗？我不知道，但他们真的说拜托拜托，唯一的要求就是拼出来。这是什么生活？不可思议。这一集还由 Finn 赞助，客户服务领域的第一 A_I_（人工智能）代理。找到利基市场并成为第一。这就是这里的想法。任何建立 A_I_ 公司的公司，我们谈论这个，A_G_I_（通用人工智能）的梦想是否已经破灭，我认为对很多公司来说，成功在于利基市场。但有一些公司，FINN 在这个利基市场上表现出色。它得到了包括 A_I_ 公司在内的六千多位客户服务领导者的信任。当一家 A_I_ 公司信任你的公司来处理其客户服务时，这意味着你是合法的。九十天退款保证，最高一百万美元，能够处理复杂的多步骤查询，如退货、换货和争议。访问 finn.ai/lex 了解更多关于转型你的客户服务和扩展你的支持团队的信息。那是 finn.ai/lex。我不知道为什么我切换到这种夸张的声音。糟糕的播音员，糟糕的电台主持人，糟糕的广告朗读声音。就是这样。感谢你一直以来的支持。我感受到爱，并将它回馈给你。这一集还由工程师们充满爱的公司 Shopify 赞助。每次想到 Shopify，我的脸上都会露出微笑。我在 NeurIPS（神经信息处理系统会议）上看到了他们的工程展台，这是一个机器学习会议。真的很聪明的人，了不起的人。当然，CEO Toby 仍然在编程，仍然在构建东西，仍然关注工程的细节，现在也在谈论利用 LLMs（大型语言模型）进行他自己的项目，也在公司内部。当公司从最高层开始就热爱工程时，这真是令人难以置信。这是对伟大工程的庆祝。就像与 DHH（Ruby on Rails 的创始人，Shopify 基于它构建）的对话，那次对话是对伟大工程的庆祝。工程之美。无论如何，听听那一集，看看我们谈论的 Ruby on Rails 的一些魔力和 Shopify 的魔力以及 Toby 的魔力。无论如何，在 Shopify.com/luxe 注册一个每月一美元的试用期。全是小写。访问 Shopify.com/luxe 将你的业务提升到一个新的水平。这一集还由 CodeRabbit 赞助，这是一个直接在你的终端中提供 A_I_ 驱动代码审查的平台。我们在这一集中谈论了人类程序员完全自动化的时间表。我认为我们离将人类排除在外还有很长的路要走。审查过程，调试过程，所有这些都是编程中至关重要的一部分，尤其是当我们谈论的不是个人网站时，HTML SLOP（HTML 代码的混乱）是浏览器神奇地，自动神奇地，不知道他们是如何做到如此不可思议的渲染 SLOP 的，但浏览器实际上能够渲染 SLOP，包括 A_I_ SLOP。它总能找到办法。所以真正的问题是，当你有生产代码时，很多用户依赖的东西，你如何审查那段代码？你如何确保捕捉到错误？你如何确保你为 A_I_ 编码代理可能产生的幻觉和逻辑错误设置了一个后盾？无论如何，code rabbit 支持所有编程语言。今天在 coderabbit.ai/LEX 安装 code rabbit CLI（命令行界面）。那是 coderabbit.ai/LEX。这一集还由 Element 赞助，我每天饮用的无糖美味电解质混合物。让我想起我需要去编辑我和 Paul Rosling 在丛林中的视频，他是一个如此不可思议的人。祝贺 Paul 取得的所有成功。去买他的书。这是一本不可思议的书。再次，他是一个不可思议的人，有着不可思议的使命。是的，我需要编辑和发布，希望至少能讲述我们在丛林中的旅程，因为那是对自然、丛林、友谊和人类体验丰富性的美好庆祝。那是美丽的。我提到这一点的原因是，在那次旅程中，我严重脱水，我记得梦见 element，一杯冰水和电解质。你的身体渴望它，因为它需要它。钠、钾、镁。当你缺乏时，不仅仅是水，还有电解质。所以无论如何，我总是记得这一点。购买任何产品即可获得免费八包样品，试试 drinkelement.com/lex。这是 Lex 处理播客。为了支持它，请查看描述中的赞助商，在那里你还可以找到联系我的链接，提问，获得反馈等等。现在，亲爱的朋友们，这里是 Sebastian Raschka 和 Nathan Lambert。我认为看待这一切的一个有用视角是所谓的 deep seek 时刻。这发生在大约一年前的 2025 年 1 月，当时中国的 open-weight 公司 deep seek 发布了 deep seek R_1，我认为可以说它以更少的计算量和更低的成本达到了接近或达到最先进的性能，从那时到今天，A_I_ 竞争变得疯狂，无论是在研究层面还是产品层面，都在加速。让我们今天讨论这一切，也许我们可以从一些刺激性的问题开始。谁在国际层面上获胜？你会说是中国的一些公司还是美国的一些公司？Sebastian 和 Nathan，很高兴见到你们。那么 Sebastian，你认为谁在获胜？嗯，获胜是一个非常广泛的术语。我会说你提到了 deep seek 时刻，我确实认为 deep seek 赢得了那些从事 open weight 模型的人的心，因为他们将这些模型作为开放模型共享。获胜我认为有多个时间尺度。我们有今天，我们有明年，我们有十年后。我知道的一件事是，我不认为在 2026 年，会有任何公司拥有其他公司无法获得的技术。这主要是因为研究人员经常换工作，换实验室，他们会轮换。所以我不认为在技术获取方面会有明显的赢家。然而，我确实认为区分因素将是预算和硬件限制。所以我不认为想法会是专有的，但实现它们所需的方式或资源会是。我目前看不到一个赢家通吃的局面。我目前看不到。Nathan，你怎么看？

Speaker 1

你会发现不同的实验室在他们的目标上投入了不同的精力。我觉得为了标记我们录制这个播客的时间点，关于Anthropic的Claude Opus 4.5模型的炒作实在是太疯狂了。我最近几周使用并构建了这个模型，几乎感觉这种炒作有点像一个梗。这很有趣，因为这非常自然。如果我们回到几个月前，我们可以看到Gemini 3从谷歌发布的日期和说明，当时的市场营销和震撼效果都非常高。但在十一月底，Claude Opus 4.5发布后，热度不断增长，而Gemini 3在这之前发布，现在感觉人们不太谈论它了。尽管当时大家都觉得这是Gemini重新夺回谷歌在AI结构优势的时刻。Gemini 3是一个很棒的模型，我仍然在使用它，只是它的差异化程度较低。

Speaker 2

我同意Sebastian所说的，所有这些想法空间都非常流动，但从文化上讲，Anthropic以在代码上押注很大而闻名，目前他们的云代码项目正在取得成功。所以我认为即使想法流动得很自由，很多事情还是受限于人力和组织文化，而Anthropic至少表现得最不混乱。这是一个优势，如果他们能继续这样做一段时间的话。但另一方面，中国有很多令人担忧的技术，那里的实验室数量远超过DeepSeek。DeepSeek在中国引发了一场运动，我觉得有点类似于ChatGPT在美国引发的运动，所有东西都有一个聊天机器人。现在中国有很多科技公司正在发布非常强大的前沿开放权重模型，我会说DeepSeek有点失去了作为中国首屈一指的开放模型制造商的地位，比如Z.AI的GLM模型、Minimax的模型、Kimi特别是在过去几个月里表现得更加出色。新的DeepSeek模型仍然非常强大，但这可能会被视为一个重要的叙述点，在2025年DeepSeek出现，然后它为更多中国公司发布这些出色的模型提供了平台。这些中国公司的模型是开放权重的，取决于这些美国公司商业模式的轨迹，可能会有风险。但目前，美国有很多人正在为AI软件付费，而在中国和世界其他地方，人们历史上对软件的付费并不多。

Speaker 1

所以像DeepSeek这样的模型因为是开放权重的而受到人们的喜爱。你认为中国公司会继续发布开放权重模型多久？

Speaker 2

我会说几年。我认为在美国没有明确的商业模式。我已经写了关于开放模型的文章有一段时间了，这些中国公司意识到了这一点，所以我从他们那里收到了一些信息。他们很聪明，意识到了同样的限制，那就是很多美国科技公司和其他IT公司不会因为安全问题而为中国公司的API订阅付费。这在科技界一直是一种长期存在的习惯，这些公司的员工将开放权重模型视为影响和参与美国日益增长的AI支出市场的能力。他们对此非常现实，并且这对他们有效，我认为政府会看到这在国际上在技术采用方面建立了很大的影响力。因此，会有很多激励措施来维持下去，但构建这些模型和进行研究非常昂贵。所以我预计会有一些整合，但我不认为这会是2026年的故事，届时会有更多的开放模型构建者，而不是2025年，很多著名的会在中国。

Speaker 1

你刚才要说什么？

Speaker 2

嗯，是的，你提到DeepSig失去了它的王冠。我确实认为在某种程度上是的，但我们也必须考虑到他们仍然稍微领先一点，并不是说DeepSig变得更差了，而是其他公司在使用DeepSig的想法。例如你提到的Kimi，使用相同的架构，他们在训练它，然后我们又有这种跳跃式发展，他们可能在某个时间点会更好，因为他们有更新的模型。我认为这回到了一个事实，那就是不会有一个明确的赢家，这将是这样，一个人发布了某个东西，另一个人跟进，最近的模型可能总是最好的模型。

Speaker 1

是的。我们还会看到中国公司有不同的激励措施。比如DeepSeek非常保密，而一些初创公司，比如Minimax和Z.AI，它们已经提交了IPO（首次公开募股）文件，并试图获得西方的关注，并在那里进行大量的宣传。所以我不知道这些激励措施是否会改变模型开发，因为DeepSeek是由一家对冲基金High Flyer Capital建立的，我们不知道他们具体在做什么，我们不知道他们用这些模型做什么或者他们是否关心这个。

Speaker 2

在技术报告方面，他们并不保密，这些报告描述了他们的模型是如何工作的。在这方面他们仍然是开放的。我们还应该说关于Opus 4.5的炒作，有一层是X回声室的宠儿，Twitter回声室，以及实际使用这个模型的人数。我认为可以公平地说，ChatGPT和Gemini专注于广泛的用户群，他们只是想解决日常生活中的问题，而这个用户群体非常庞大。所以关于编码的炒作可能并不代表实际的使用情况。

Speaker 1

我还想说，很多使用模式就像你说的名字识别、品牌等，还有几乎是肌肉记忆。比如ChatGPT已经存在很长时间了，人们习惯于使用它，这有点像飞轮，他们推荐给其他用户。一个有趣的点是LMM（大语言模型）的定制化，比如ChatGPT有一个记忆功能，你可能有一个订阅，你用它来处理个人事务，但我不知道你是否会在工作中使用同样的东西，因为这是私人和工作的界限。如果你在公司工作，他们可能不允许这样做。或者你可能不想这样做。我认为这也是一个有趣的点，你可能会有多个订阅。一个是纯粹的代码，没有你的个人图像或爱好项目在里面。只是工作上的东西。另一个是你的个人事务。所以我认为这也是一个不同的用例，并不意味着你只能有一个。我认为未来也会有多个。

Speaker 2

你认为哪个模型在2025年赢了？你认为哪个模型会在2026年赢？

我认为在消费者聊天机器人领域，问题在于你是否愿意押注Gemini而不是ChatGPT。我直觉上觉得这有点冒险，因为OpenAI一直是行业的领导者，在技术上有很多优势。我觉得2025年的势头在Gemini那边，但他们起步点很低。我认为他们在克服组织混乱方面做得很好。但同时也很难对抗OpenAI，因为尽管他们看起来很混乱，但总能成功。我个人对GPT-5的评价很复杂，但它肯定为他们省了很多钱，主要功能是一个路由器，大多数用户不再像以前那样需要支付GPU（图形处理器）费用。所以我觉得很难将我喜欢的模型特性和对公众有吸引力的特性区分开来。

你怎么看2026年？谁会赢？

虽然有风险，但我会说我认为Gemini会继续在ChatGPT上取得进展。我认为在这两个都在极端规模上运行时，Google的规模和能力使他们能更好地分离研究和产品，而WebAI在运营上显得很混乱，追求高影响力的事情，这是一种初创公司的文化。在软件和企业方面，我认为Anthropic会继续成功，因为他们一次又一次地为此做好了准备。显然，Google的云服务有很多产品，但这在生态系统中是一个更复杂的事情，因为它与Azure和AWS竞争，而不是在模型提供商方面。

所以在基础设施方面，你认为TPU（张量处理单元）有优势吗？

主要是因为NVIDIA芯片的利润率太高，而Google可以从上到下开发一切以适应他们的架构，不必支付这种利润率，并且他们在建设数据中心方面有先发优势。因此，所有这些高前期时间和高成本的硬件，Google在这方面有历史优势。如果有新的范式，很可能来自OpenAI，他们的研究部门一次又一次地展示了推出新研究理念或产品的能力。我认为像深度研究、SORA、O_1思维模型等定义性的东西都来自OpenAI，这应该是他们作为一个组织的顶级特质之一。所以很难对抗这一点，但我认为今年很多事情将围绕规模和优化模型中的低垂果实。

显然，在智能和速度之间有一个权衡。这就是ChatGPT-5在幕后试图解决的问题。公众到底是想要智能，还是想要速度？

我认为实际上有一个不错的选择，或者说有一个选项可以切换。对我个人而言，大多数时候当我查找某些东西时，我用ChatGPT快速提问，快速获取我想要的信息。对于大多数日常任务，我使用快速模型。现在我认为自动模式相当不错，你不必特别说明是思考模式还是非思考模式。不过有时我也想要专业模式。通常我会把写好的东西放到ChatGPT中，检查我的引用是否正确，我的想法是否正确，格式是否有错误，图表编号是否有误。这些我不需要马上得到答案。我完成工作后，可能去吃个晚饭，让它运行，回来时它已经检查完了。我认为有这个选项很重要。如果每次查询都要等三十分钟或十分钟，我会疯掉的。

非思考模式，我就想，你怎么能忍受呢？我一直在用ChatGPT，从未碰过五代非思考模式，我觉得它的语气和错误倾向太高。这些是从OpenAI发布O_3时开始的，那是第一个进行深度搜索并为你整合多个来源的模型。所以我习惯于此，我只会用GPT-5.2思考模式或当我寻找任何工作信息查询时，无论是论文还是代码参考。我经常会同时进行五个专业查询，每个查询寻找一个特定的论文或方程反馈。我有一个有趣的例子，我需要在去旅行前尽快为这个播客找到答案。我家里有一个本地GPU在运行，我想运行一个长时间的RRL（强化学习）实验。通常我会拔掉电源，因为如果不在家，你不想让东西插着电。我不小心拔掉了GPU。当时我太太已经在车上，我就想，哦，糟糕，然后我想尽快得到一个bash脚本来运行我的不同实验和评估。我知道如何使用bash界面，但那一刻我只需要十秒钟给我命令。

这是个搞笑的情况，但你用了什么？

我用了非思考最快的模型。它给了我bash命令，我用它来将不同的脚本链接在一起。然后你有一个T的东西，你想把它路由到一个日志文件。那一刻我很着急，本可以自己想出来。

顺便说一句，我不知道这是否是一个典型的案例，太太在车上等，你得跑过去，插上GPU，生成一个bash脚本，这听起来像电影一样，你希望这能实现。

我用Gemini来做这些。所以我用思考模式处理所有信息类的东西，然后用Gemini来处理快速的事情或我有时间Google的事情，因为它擅长解释东西，我相信它有这种知识背景，而且简单。Gemini应用程序已经变得好多了，适合这种事情。然后对于代码和任何哲学讨论，我用Claude Opus 4.5。总是用扩展思考。扩展思考和推理时间缩放只是让模型稍微聪明一点的方式，我总是倾向于在进展很高的时候使用，因为你不知道什么时候会解锁一个新用例，有时用GROC来获取实时信息或在AI Twitter上找到我知道我看到过的东西，然后我就专注于此。虽然当GROC 4出来时，他们的专业版本非常好，我对它印象深刻，但那只是肌肉记忆，失去了对ChatTVT应用程序的跟踪。所以我用了很多不同的东西。

Speaker 1

是的，我确实会用GRAP进行深入调试，当其他工具解决不了问题时，我发现它是最好的。你说G_P_T_是最好的界面，这很有趣。对我来说，可能是因为习惯，Gemini对我来说是更好的界面。我喜欢他们在大数据量中找到关键信息的能力。如果我有很多上下文信息，但需要特定的信息，确保它能追踪到所有信息。我发现至少对我来说，Gemini是最好的。所以有趣的是，这些模型如果某一天因为某个特定功能赢得了你的心，你就会觉得这个模型更好，然后你会坚持用它，直到它做了一些愚蠢的事情，就像有一个阈值效应一样，先是聪明的事情让你爱上它，然后它做了一些愚蠢的事情，你就会想换到tri-cloud或J_G_P_T_等等。

Speaker 2

这就像你用它直到它出问题，然后你就会更换L_M_（语言模型）。我觉得我们使用任何东西都是这样，比如我们喜欢的文本编辑器、操作系统或浏览器。浏览器有很多选择，Safari、Firefox、Chrome，它们都相对相似，但有些特殊情况，比如你想用的扩展，然后你就会切换。但我不认为有人会在不同的浏览器中输入相同的网站并进行比较，只有当网站无法渲染或出问题时才会这样做。所以这是个好观点，我觉得你用它直到它出问题，然后你就会探索其他选项。

Speaker 1

关于长上下文，我也曾是Gemini的用户，但G_P_T_ 5.2版本的发布博客中有非常长的上下文分数，很多人都在想他们是不是发现了一些算法上的变化，从30%提高到70%或者类似的东西，这只是一个小的模型更新。所以很难跟踪所有这些变化，但现在我更看好G_P_T_ 5.2的长上下文，这就像是我如何进行测试的无休止的战斗。

Speaker 2

有趣的是，我们没有从用户使用的角度谈论中国的模型。这说明了什么？是说中国的模型不够好，还是我们只是非常偏向美国？

Speaker 1

我确实认为这是模型和平台之间的差异。目前开放模型更以开放权重而非平台闻名。

Speaker 2

嗯。

Speaker 1

所以美国的模型在输出方面更好，我想问题是它们会在今年及未来几年保持这种优势吗？但只要它们更好，我就愿意付费使用。我认为还有分析显示，中国模型的服务方式可能由于专家控制或其他原因，使用的G_P_U_（图形处理单元）较少，这使得它们更慢并有不同的错误。如果这些因素对你有利，很多美国用户会选择这些模型，我认为这会促使中国公司在其他方面竞争，比如免费或大幅降低成本，或者在产品上激发创造力，这对生态系统是有益的。但我只是觉得一个简单的事实是，美国的模型目前更好，我们使用它们，我尝试过中国的模型和其他开放模型，觉得有趣但不会回去使用。

Speaker 2

我们没有提到编程，这是很多人非常关心的一个用例。我基本上用一半的时间在cursor和clog code上，因为我发现它们提供了根本不同的体验，而且都很有用。你们编程用得多吗？你们现在用什么？

Speaker 1

我在V_S_ code上使用codecs插件，它非常方便，就像一个插件，然后是一个可以访问你代码库的聊天界面。我知道clog code有点不同，它更具主动性，可以为你完成整个项目。我还没有到完全适应那种程度，可能是因为我是个控制狂，但我仍然想看到一点发生了什么，codecs对我来说现在是一个甜蜜点，它在帮助我，但没有完全接管。

Speaker 2

我应该提到，我使用clod code的原因之一是为了培养用英语编程的技能。体验是根本不同的，你不再是微观管理代码生成过程的细节，而是查看差异，如果你使用的是那个I_D_（集成开发环境），并在过程中改变、查看和深入理解代码，而不是只是在设计空间中思考并在宏观层面上引导它。我认为这是另一种思考编程过程的方式。我们还应该说，Claude code似乎在利用Claude Opus 4.5方面表现更好。

Speaker 1

这是一个很好的对比，大家可以同时打开Claude code、cursor和V_S_ code，并在所有这些上选择相同的模型，提出非常有趣的问题。比如Claude code在这个领域表现得更好，真是了不起。

Speaker 2

我们应该说你们两位在多个领域都是专家，研究人员、程序员、教育者、推特用户，在书籍方面也是如此。Nathan，希望不久后会有一本R_L_H_F_（基于人类反馈的强化学习）书出版。

Speaker 1

现在可以预购，还有一个完整的数字预印本，只是为了让实体书更漂亮和更有条理，这也是我做这件事的很大一部分原因，因为创造你认为优秀的实体作品是很有趣的，尤其在我们生活的数字化时代。

Speaker 2

我应该提到，Sebastian，Rasha是一位机器学习研究员和作者，以多本有影响力的书籍而闻名，我想提到其中几本，我强烈推荐《从零开始构建大型语言模型》和新书《从零开始构建推理模型》。我对此感到非常兴奋。从零开始构建东西是学习最强大的方式之一。

Speaker 1

老实说，从零开始构建L_M_（语言模型）很有趣，也有很多东西要学，正如你所说，这可能是学习某事物真正工作原理的最佳方式，因为你可以查看图表，但图表可能有错误。你可以查看概念解释，但你可能会误解它们。但如果你看到代码并且代码有效，你就知道它是正确的。没有误解，它是精确的，否则它就不会工作。我认为这就是编程的美丽所在，它不会撒谎，本质上就是数学。

Speaker 2

即使在数学中，我认为你在书中可能会有错误而不会注意到，因为你在阅读书时并没有运行数学，你无法验证。而代码的好处是你可以验证它。

Speaker 1

是的，我同意你关于《L_M_ from Scratch》这本书的看法。能够屏蔽掉其他一切，比如互联网，只专注于这本书，感觉很好。不过你知道的，我读了好几本历史书，感觉没那么孤独，真的更有趣。比如在编程方面，我觉得和L_L_M_（大型语言模型，Large Language Model）一起编程真的更有趣，我觉得和L_L_M_一起阅读也更有趣，但你说得对，这种干扰应该尽量减少。所以，你可以用L_L_M_来丰富体验，也许可以增加更多的背景信息，我觉得在小规模的情况下，L_M_带来的“啊哈”时刻的频率真的很高。

百分之百同意，我也想纠正一下自己，我不是建议不使用L_M_M_（大型语言模型模型，Large Language Model Model），而是建议多次阅读，比如第一次就离线专注阅读，然后我也会做笔记，但我尽量抵制立即查找的冲动。我会进行第二次阅读，这样对我来说更有结构性，有时问题在章节中就得到了解答，但有时也有助于让它沉淀下来，思考一下。其他人有不同的偏好，我强烈推荐在阅读书籍时使用L_L_M_。对我来说，这不是第一步，而是第二步。

Speaker 2

顺便说一下，我的建议是相反的，我喜欢在开始时使用L_M_来铺设整个背景，比如我现在要进入的这个世界是什么样的，但我尽量避免从L_M_M_跳到像Twitter和博客的世界，因为那样你就进入了一个兔子洞，阅读别人的观点，关于某个话题的争论突然间你不再是在互联网和Reddit的领域中。

如果你纯粹让L_L_M_给你提供为什么这很重要的大局观念，但有时书本身就能很好地做到这一点，但并不总是如此，这就是为什么我喜欢Chat G_P_T_应用程序，因为它给了A_I_一个在你电脑中的家，你可以专注于它，而不是仅仅成为我混乱的互联网选项中的另一个选项。我认为Cloud Code在这方面做得特别好，作为一个产品设计得非常吸引人，成为你的A_I_与世界互动的界面，它和Codecs之间有一种无形的温暖和吸引力，而Codecs虽然来自Open A_I_，但感觉有点粗糙，而Cloud Code让从头开始构建东西变得有趣，你不必在意，但你相信它会做出好东西，这显然对网站和工具更新之类的东西很有用，我会用它来进行数据分析，比如我的博客，我们抓取Hugging Face的数据，所以我们现在有每个数据集和模型的下载数量，这样我们就有了这些数据，Cloud Code就像是“是的，我利用了这些数据，没问题”，这样就不会浪费几天时间。然后我有足够的情境意识来判断这些趋势显然是有意义的，你可以检查这些东西。这是一个很棒的界面，你可以有一个中介，而不必做那些你需要做的可怕的低级工作来维护不同的网络项目。

好的，我们刚刚谈到了一些封闭权重模型。现在让我们谈谈开放的模型。告诉我开放L_M_模型的格局。哪些是有趣的，哪些让你印象深刻？我们已经提到了DeepSeek。你想看看我们能不看笔记说出多少个名字吗？

Speaker 1

是的，是的，不看笔记。

DeepSeek、KIMI、Minimax、Z_ dot A_I_、Ant、Lang。我们只是说中文的。然后是Mistral A_I_、JAMA、G_P_T_O_S_，实际上是由Chet G_P_T_开源的模型。NVIDIA的NemoTron有一个非常酷的模型，NemoTron三。尤其是在年底有很多东西。Quen，一个可能是最紧迫的。我试图至少说出十个中国的和十个西方的。我认为，Opening Eye自G_P_T_二以来发布了他们的第一个开放模型。当我写关于Opening Eye的开放模型发布时，他们都说不要忘记G_P_T_二，我觉得这很有趣，因为这是一个完全不同的时代。但G_P_T_O_S_实际上是一个非常强大的模型，做了一些其他模型不太擅长的事情，我认为自私地我会推广一些西方公司，所以在美国和欧洲都有这些完全开放的模型。我在Allen Institute for A_I_工作。我们一直在构建OLLMO，发布数据和代码等等。现在我们有实际的竞争，试图发布一切，以便其他人可以训练这些模型。所以有一个基础模型研究所，它的L_M_三六十，拥有各种类型的K_二模型。Aparis是一个瑞士研究联盟。Huggingface有一个小型L_M_，非常受欢迎，Nvidia的Neematron也开始发布数据，然后是斯坦福的Marin社区项目，它使得人们可以在GitHub上提交问题并实现新想法，然后在稳定的语言建模堆栈中运行。在这个领域，这个名单在2024年要小得多，所以我认为当时只有A_I_二，所以这对更多人参与和理解语言模型来说是件好事，这并没有一个中国公司有类似的东西。顺便说一下，中国的开放语言模型往往更大，这使得它们在M_O_E_s（专家混合模型，Mixture of Experts）中有更高的峰值性能，我们喜欢的很多东西，无论是JEMMA还是NIMA-TRAN，往往是来自美国的小型模型，这种情况正在改变，来自美国和欧洲。MISTRA large三问世，这是一个巨大的M_O_E_模型，非常类似于DeepSeek架构，十二月推出。然后一家初创公司R_C_A_I_和NIMA-TRAN以及NVIDIA都预告了M_O_E_模型，这些模型的参数超过一千亿，像是四千亿参数范围，将在2026年第一季度推出。所以我认为今年这种平衡会发生变化，人们使用中国和美国开放模型的用途会有所不同，我个人对此非常期待。

首先，非常佩服你能说出这么多名字。你真的提到过llama吗？

Speaker 2

嗯，没有。

如果你喜欢R_I_P_。

不是故意的。好的，P_llama，你能提到一些有趣的模型吗？所以你提到QUIN三显然是一个亮点。

我会说这一年几乎是由DeepSeek版本三和R_一开头，然后在十二月是DeepSeek版本三点二结束，因为我喜欢这些模型的原因是它们总是有一些其他模型没有的有趣架构调整。不过如果你想要熟悉但性能真的很好的模型，比如QUIN三和Nathan提到的G_P_D_O_S_S_，我认为G_P_D_O_S_S_有趣的是它是第一个真正用公开权重训练的模型。

[SPEAKER 1] 我认为这有点像一个范式转变，生态系统还没有完全准备好接受这种变化。关于工具使用，我指的是L_L_M_（大型语言模型）能够进行网页搜索或调用Python解释器。我认为这是一个巨大的突破，因为L_L_M_最常见的一个问题就是幻觉（hallucinations）。在我看来，解决幻觉的最佳方法之一就是不要总是试图记住信息或编造信息。为什么不使用计算器应用或Python呢？如果我问N_L_M_（自然语言模型）1998年谁赢得了世界杯，它不需要记住答案，而是可以去搜索。我认为通常还是用Google搜索。因此，G_G_P_D_ G_P_D_O_S_S_会调用Google工具，可能找到FIFA网站，发现是法国赢得了比赛。这样，它可以可靠地获取信息，而不是仅仅试图记住它。所以我认为这是一个巨大的突破，但我认为目前开源、开放权重生态系统还没有充分利用这一点。很多人不使用工具调用模式，因为首先这是一个信任问题。你不想在你的计算机上运行它，因为它可以访问工具，可能会擦除你的硬盘或其他东西。所以你可能想要将其容器化。但我确实认为，这对于未来几年是一个非常重要的步骤。

[SPEAKER 2] 首先，非常感谢你定义了你所说的工具使用。我认为这对于我们讨论的概念来说是一个很好的做法。即使是像M_O_Es（混合专家模型）这样已经比较成熟的概念，你也需要解释这意味着什么，如何实际使用，有哪些不同的变体。那么，开放模型的爆炸式增长意味着什么？你的直觉是什么？

[SPEAKER 1] 对于一个开放模型，你希望人们首先使用它，然后才考虑透明度和信任等问题。我认为，当你看中国时，最大的原因是他们希望全世界的人使用这些模型，而我认为很多人不会。如果你看美国以外的地方，很多人不会为软件付费，但他们可能有计算资源，可以在上面运行模型。我认为也可能有一些数据你不想发送到云端。所以最重要的是让人们使用模型，使用A_I_（人工智能）或者使用你的A_I_，即使他们可能无法做到这一点。

[SPEAKER 2] 我们应该明确说明，我们一直在谈论这些中国模型和开放权重模型，通常它们是在本地运行的。所以这并不是说你在把数据发送到中国或者硅谷开发模型的人那里。

[SPEAKER 1] 很多美国初创公司通过托管这些来自中国的模型赚钱，并出售它们，这被称为出售代币，这意味着有人会调用模型来完成某些工作。我认为另一个原因是对于美国公司来说，他们喜欢开放的眼界，所以G_P_U_（图形处理单元）资源非常紧张。每当他们发布新版本时，他们总是谈论我们的G_P_U_资源紧张，我认为在其中一个G_P_T_O_S_S_发布会上，Sam Altman说我们发布这个是因为我们可以使用你的资源，我们不必使用我们的G_P_U_，而且我仍然可以获得分发，这是一个非常现实的事情，因为这不需要他们花费任何成本。

[SPEAKER 2] 对于用户来说，我认为也有用户只是本地使用模型，就像他们使用G_P_D_一样，但对于公司来说，我认为拥有这些模型是一个巨大的突破，因为你可以定制它们，你可以训练它们，你可以在训练后添加更多数据，比如将它们专门化为法律或医学模型。你提到的中国的开放权重模型的吸引力在于这些模型的许可证更加友好。我认为它们是完全不受限制的开源，而如果我们使用像llama或gemma这样的东西，会有一些附加条件。我认为在用户数量方面有一个上限，如果你超过了某个数量，你需要向比如meta报告你的财务状况。我认为虽然这是一个免费的模型，但有附加条件，人们喜欢没有附加条件的东西。所以我认为这也是除了性能之外，中国的开放权重模型如此受欢迎的原因之一，因为你可以使用它们，而且没有任何附加条件。

[SPEAKER 1] 生态系统在这方面已经有所改善，但主要是因为这些新提供商提供了如此开放的许可证。当你提到Perplexity和Kimmy K_二号时，我觉得很有趣，它们托管在美国，这正是我们所谈论的，人们对此非常敏感。Kimmy K_二号是一个非常受欢迎的模型，人们说它在创意写作和软件开发方面表现很好。所以这些就是人们在不同模型中发现的细微差别，他们喜欢这些模型。

[SPEAKER 2] 你能谈谈这些模型中一些有趣的想法吗？比如说那些特别吸引你的？

[SPEAKER 1] 也许我们可以按时间顺序来谈。我是说当然有DeepSeek，DeepSeek R_一号在2025年1月发布，如果我们只关注2025年。然而，这基于DeepSeek版本三，该版本在2024年12月发布。在架构方面有很多东西。令人着迷的是，你仍然可以从G_P_D_二号开始，并在该模型上添加东西，使其成为另一个模型。所以它们之间仍然有一种传承关系。但我能想到的，DeepSeek的独特之处在于混合专家模型（Mixture of Experts），虽然他们并没有发明混合专家模型。我们可以稍后详细讨论混合专家模型的含义。但在我们深入讨论之前，先列出这些东西，混合专家模型，还有多头潜在注意力机制，这是对注意力机制的一个调整，我认为这是2025年这些开放权重模型之间的主要区别因素。还有其他不同的调整来优化推理或K_V_缓存大小。我们稍后可以定义K_V_缓存。但为了使长文本处理更经济，缩小K_V_缓存大小。有哪些调整可以做，大多数都集中在注意力机制上。在DeepSeek中有多头潜在注意力，还有组查询注意力，这仍然非常流行。它并不是由这些模型发明的，可以追溯到几年前。但那是另一个选择。我认为滑动窗口注意力是DeepSeek三号使用的，如果我没记错的话。所以这些不同的调整使得模型不同。否则，我把它们都放在一篇文章中进行比较，它们非常相似，只是在变压器块的重复次数和一些小调整上有所不同。但令人欣慰的是，无论如何它都有效。你可以调整一些东西，移动归一化层的位置，

获得一些性能提升，我几乎总是在评估研究中表现出色，展示了如果你调整某些东西，对模型的实际影响是什么。评估研究本身并不会让它变得更好或更差。但有很多种方法可以实现一个transformer（转换器）并让它继续工作。仍然盛行的大想法是mixture of experts（专家混合）、multi-attend attention（多重注意力）、sliding window attention（滑动窗口注意力）、group query attention（组查询注意力），然后在年底我们看到关注点是让注意力机制随着推理token（标记）线性扩展。

例如，接下来是QAM三，它添加了一个gated delta net（门控增量网络）。这有点像受state space models（状态空间模型）的启发，你有一个固定的状态不断更新，但它基本上使这个注意力更便宜，或者用更便宜的操作替代注意力。

也许回过头来谈谈transformer架构是有用的？

是的，也许我们应该从G_P_T_二的架构开始，这是从attention is all you need（注意力是你所需的一切）论文中衍生出来的transformer。注意力是你所需的一切论文中有一个transformer架构，它有两个部分，一个编码器和一个解码器。而G_P_T_只专注于解码器部分。它本质上仍然是一个神经网络，并且内部有这个注意力机制。你一次预测一个token，通过一个embedding layer（嵌入层），然后是transformer block（转换器块）。transformer块有注意力模块和一个全连接层。中间有一些归一化层。但它本质上是带有注意力机制的神经网络层。所以从G_P_T_二开始，当我们转向G_P_T_O_S_S_时，例如有专家混合层。这不是G_P_T_O_S_S_发明的，已经有几年了。但它本质上是一个调整，使模型更大而不在每次前向传递中消耗更多计算。

这里有一个全连接层，如果听众熟悉multi-layer perceptrons（多层感知器），你可以把它想象成transformer内部的一个小型多层感知器，一个全连接的神经网络层，因为它是全连接的，所以非常昂贵。如果你有一千个输入和一千个输出，那就像一百万个连接，这是transformer中非常昂贵的部分，想法是将其扩展为多个前馈网络。所以不是只有一个，比如说你有两个五十个。

但这会使它更昂贵，因为现在我们有二十个五十六个。但你不会同时使用所有这些。所以你现在有一个路由器，它说，好吧，根据这个输入token，使用这个全连接网络会更有用。在这种情况下，它被称为专家。所以专家混合意味着我们有多个专家。根据你的输入是什么，比如说它更偏重数学，它会使用不同的专家，而不是将输入文本从英语翻译成西班牙语。它可能会咨询不同的专家。

并不是很清楚，我的意思是明确地说，好吧，这只是一个数学专家，而对于西班牙语来说，有点模糊，但基本思想是你将更多的知识打包到网络中，但不是所有的知识都一直在使用。那样会非常浪费。所以在生成token的过程中，你会更有选择性，有一个路由器选择哪些token应该去哪个专家。这更复杂，更难训练，有很多可能出错的地方，比如崩溃等等。所以我认为

这就是为什么Alma三仍然使用密集。我是说你有我认为Alma模型有专家混合，但密集模型，密集的意思是，这也是行话。密集和稀疏之间有区别。所以专家混合被认为是稀疏的，因为我们有很多专家，但只有少数是活跃的。这被称为稀疏。然后密集就是相反的，你只有一个全连接模块，并且它总是被利用。

所以也许这是一个谈论K_V_缓存的好地方，但实际上在那之前，甚至更广泛地说，从G_P_T_二到今天，究竟有多少新想法被实现了。这些架构真的有多不同？

嗯。

用R_M_S_归一化替换层归一化，但这只是一个不同的归一化层，不是一个大的变化，只是一个调整。非线性激活函数，熟悉深度神经网络的人，我的意思是这就像用relu替换sigmoid一样，它并没有从根本上改变网络，只是一个小调整。我会说，就是这样，它并没有从根本上不同，它仍然是相同的架构，所以你可以通过添加这些

基本上进行转换。嗯。是的。例如你之前提到我的书，那是书中的G_P_T_二模型，因为它简单而且非常小，大约一百二十四一百二十万个参数。但在附加材料中，我确实有从头开始的Alma三，从头开始的GEMMA三，以及其他类型的从头开始的模型。我总是从我的G_P_T_二模型开始，只是调整不同的组件，你就可以从一个转换到另一个。这有点像传承的感觉，是的。

但对于人们的直觉，因为当你放眼望去，看到AI世界中有如此快速的进步，同时从根本上说架构并没有改变。那么所有的动荡，进步的动荡在哪里发生？收益在哪里？

你有预训练。现在，当时只有G_P_T_二的预训练。现在你有预训练、中期训练和后期训练。所以我认为我们现在处于后期训练的关注阶段。我的意思是，如果你将其扩展到更高质量的数据，预训练仍然会给你带来优势。但我们有一些能力解锁，这是G_P_T_二没有的。例如CHET G_P_T_基本上是一个G_P_T_三模型，G_P_T_三在架构上与G_P_T_二相同。新的东西是添加了监督微调和人类反馈的强化学习。所以更多是在算法方面，而不是架构。

我会说系统也发生了很大变化。我想如果你听NVIDIA的公告，他们会谈论这些事情，比如你现在可以做F_P_八，你现在可以做F_P_四，发生的事情是这些实验室正在研究如何利用更多的计算资源将其投入到一个模型中，这让他们可以更快地训练，并且可以投入更多的数据，然后通过这样做可以更快地找到更好的配置。

可以看作是每秒每个G_P_U_的token数量是一个你在进行大规模训练时会看的指标。你可以通过开启F_P_八训练从十K_到十三K_，这意味着你在模型中每个参数使用更少的内存。通过保存更少的信息，你减少了通信，可以更快地训练。所以所有这些系统的东西支撑了数据和算法的更快实验，这有点像这种

当你看架构时，它们完全相同，但用于训练这些模型的代码库会有很大不同，你可能会像我不知道G_P_Us是不同的，但你可能会比G_P_T_二在当时训练得更快地训练G_P_T_O_S_S_二十B_。

Speaker 1

就速度而言，这是真的，但它并没有在某种意义上赋予模型新的能力。这只是我们可以在不影响模型性能的情况下，粗化计算到什么程度。不过，我确实认为有一些替代方案正在出现，比如transformer（变压器）的替代方案。有文本扩散模型（text diffusion models），这是一种完全不同的范式。虽然文本扩散模型可能会使用transformer架构，但它不是自回归（autoregressive）transformer。还有MAMBA模型，它是一种状态空间模型（state space model）。但它们都有权衡，目前还没有什么能够取代自回归transformer作为最先进的模型。所以对于最先进的技术，你仍然会选择这种方法。但也有更便宜的替代方案，它们在某种程度上做出了一些妥协，但现在不再只有一种架构了。虽然有一些新的架构正在出现，但如果我们谈论最先进的技术，基本上还是自回归transformer架构。基本上是从G_P_T_二（GPT-2）衍生出来的。

我想这里的一个大问题是，我们已经讨论了很多关于预训练（pre-training）背后的架构。缩放定律（scaling laws）是否在预训练、后训练、推理、上下文大小、数据、合成数据等方面保持强劲？

我喜欢从缩放定律的技术定义开始，这种定义贯穿了所有这些。缩放定律是一种幂律关系，你可以认为X轴是计算和数据的组合，而Y轴是下一个token的预测准确性。所以我们谈论的模型是自回归的，就像如果你保留一组模型未见过的文本，训练后它的准确性会有多高。缩放定律的概念出现是因为人们发现这是一种非常可预测的关系，我认为这个技术术语仍在继续。然后问题是用户能从中获得什么，然后有更多类型的缩放，比如OpenAI的O_一（O_one）因引入推理时间缩放而闻名，我认为它也展示了你可以缩放强化学习训练，并在Y轴上获得线性性能提升。所以现在有三种轴，传统的缩放定律是关于预训练的，即你的模型有多大，你的数据集有多大。然后是强化学习的缩放，即你可以进行多长时间的试错学习。最后是推理时间计算，即让模型在特定问题上生成更多token。所以我对它们都很乐观，但低垂的果实大多已被摘取，尤其是在去年强化学习与可验证奖励（R_L_V_R）和推理时间缩放方面。

这就是为什么这些模型使用起来感觉如此不同，以前你会立即得到第一个token。现在它们可能会花费几秒、几分钟甚至几个小时生成这些隐藏的想法，然后才给你答案的第一个词。这一切都与推理时间缩放有关，这是模型能力变化的一个美妙的阶跃函数。它们几乎完全是因为这种带有可验证奖励的强化学习训练让模型很容易掌握这些技能。所以让模型学习。如果你观察模型生成大量token时的推理过程，它通常会尝试一个工具，查看返回的结果，再尝试另一个API，查看返回的结果是否解决了问题。因此，当你训练它们时，模型很快就学会了这样做，最终这为模型提供了一种通用的基础，使其能够在你的代码库中很好地使用CLI命令，处理git，为你移动和组织东西，或者搜索以找到更多信息。如果我们在一年前坐在这些椅子上，我们并没有真正想到模型会这样做。这只是今年发生的事情，完全改变了我们对使用AI的看法，我认为这非常神奇。这是一个如此有趣的演变，解锁了如此多的价值。但不清楚下一个解锁类似价值的途径会是什么。我认为我们稍后会谈到持续学习，但在AI的某些领域有很多热议，但没有人知道下一个阶跃函数何时会真正出现。

Speaker 2

你实际上说了很多事情，并且很快就说出了深刻的观点。我们可以稍微解开它们吗？你说你对每个版本的缩放都很乐观。那我们能不能从头开始。预训练。我们是否在暗示预训练缩放的低垂果实已经被摘取？预训练是否已经达到平台期，或者你对预训练仍然持乐观态度？

Speaker 1

预训练变得非常昂贵。我认为要扩大预训练规模，这也意味着你要为用户提供一个非常大的模型。所以我认为已经大致确定，像G_P_T_四（GPT-4）和类似的模型在最大规模时大约有一万亿参数。有很多传言说，由于训练效率提高，它们实际上变得更小了。你希望让模型变小，因为这样你的服务成本会成比例地下降。这些模型的训练成本相对于为数亿用户提供服务的成本来说真的很低。我认为DeepSeek有一个著名的数字，大约是五百万美元用于云市场价格的预训练，几乎是三百万。论文的第二点四节详细说明了我们为训练保留G_P_U_（图形处理单元）集群的时间，包括工程问题和多个种子，大约是两百万美元租用集群来处理训练模型的所有问题和麻烦。所以这些模型相当多的人可以获得一到两百万美元来训练一个模型，但为数百万用户提供服务的经常性成本实际上是数十亿美元的计算成本。我认为你可以看看接近一千个G_P_U_的租金，你可以每天支付十万美元，而这些公司可能有数百万个G_P_U_。你可以看看这些东西坐在那里要花多少钱。所以这是一个大问题，然后就是如果缩放实际上给你一个更好的模型，那它在财务上是否值得？我认为随着AI解决更多引人注目的问题，它会慢慢地被推出。所以像Claude Opus四点五这样的模型在某些事情上表现得很好。我想我启动了一个名为Adam项目的项目，这是一个在七月推出的美国真正开放模型。那是一个真正的编码网站。我有一份工作，制作图表和其他东西。然后我在过去几周回来刷新它，发现Claude Opus四点五与当时的任何模型相比，解决了它在六月和七月构建时遇到的所有问题。可能是一个更大的模型，还有其他因素，但这表明仍然有进展。

Speaker 1

所以你所说的是关于缩放法则中Y轴的细微差别，即实际体验与基准测试上的智能可能有所不同。但关于预训练，如果你扩大计算规模，模型会变得更好吗？这不是从财务角度考虑，而是从法则的角度。你认为模型会变得更聪明吗？

Speaker 2

是的。我认为这有时会让人觉得像是AI公司领导层的幻想，他们会说，计算能力已经增长了十三个数量级，为什么会停止呢？所以我认为从根本上来说，停止的可能性很小。只是最终我们甚至无法测试更大规模的计算，因为更多计算带来的问题。我听说很多关于2026年的讨论，那时非常大的Blackwell计算设施就像千兆瓦级的数据中心即将上线。这些都是在2022年和2023年签署和寻求的电力和数据中心合同，所以是在Chat GPT之前或之后不久。因此，花了两到三年的时间来建造这些更大的集群以训练模型。显然，人们对建造更多的数据中心有着极大的兴趣。这就是人们所说的关键点：这些新的集群即将到来，实验室将有更多的计算资源用于训练。

Speaker 1

你认为他们会如何利用这些资源来遵循缩放法则？这些资源主要用于推理还是训练？

Speaker 2

最终是两者兼而有之。我认为，当你训练模型时，所有的决策都回归到预训练。如果你要在模型中扩展RL（强化学习），你仍然需要决定能够实现这一点的架构。我们在讨论其他架构，比如使用不同类型的注意力机制。我们还在讨论专家混合模型（Mixture of Experts Models）。这种稀疏的M_O_E_（专家混合模型）模型在生成方面效率更高，这成为后训练的重要部分。你需要准备好你的计划，以便真正扩大计算规模。我仍然认为大部分计算资源用于预训练。因为你仍然可以让模型变得更好，你仍然想要回顾这一点。你仍然想要最好的基础模型。几年后，这将达到饱和，RL计算将会更长时间。

Speaker 1

有人不同意你的观点，他们认为预训练已经过时，重点在于扩展推理、后训练、上下文、持续学习、数据扩展、合成数据。

Speaker 2

人们可能会这样描述，但我认为这不是实际发生的情况。

Speaker 1

这个东西已经过时了。

Speaker 2

是的。

Speaker 1

嗯哼。所以，推理是

Speaker 2

如果你得到一个新的计算集群，可以让你更稳定或更快速地做某事，因为你听说过Blackwell在推出时遇到的问题，在AI二代中，大多数模型是在一到两千个GPU上进行预训练的，但当你在一万或十万个GPU上进行预训练时，会遇到非常不同的故障。GPU以奇怪的方式出故障是众所周知的，而进行十万个GPU的运行几乎可以保证至少有一个GPU会出问题。你需要让你的训练代码处理这种冗余，这只是一个非常不同的问题，就像我们在D_G_X_ Spark上进行后训练，或者你有你的书，或者人们学习ML（机器学习），他们在训练这些最大模型时所面临的挑战就是大规模分布式计算，这与系统问题不同，为了实现预训练的缩放法则，你需要同时拥有所有这些GPU。当我们转向强化学习时，它实际上适合于异构计算，因为你有很多模型副本。为了给语言模型做一个初步介绍，强化学习中，你有两组GPU。一组可以称为演员，另一组称为学习者。学习者是进行实际强化学习更新的地方。这些通常是策略梯度算法，比如PPO（近端策略优化）和GRPO（群体相对策略优化），是两种流行的类别。在另一边，你会有演员，它们生成完成项，而这些完成项是你要评分的内容。强化学习的核心是优化奖励。在实践中，你可以在世界各地有很多不同的演员做不同类型的问题，然后将其发送回这个高度联网的计算集群进行实际学习，在那里你进行梯度计算，你需要一个紧密连接的网络，以便进行不同类型的并行处理，并分散你的模型以实现高效训练。因此，每种不同类型的训练和服务都有这些需要扩展的考虑因素。我们谈到了预训练，我们谈到了RL，然后推理时间的扩展是如何为一亿用户提供思考一个小时的模型。我不太了解这个，但我知道这是一个难题，为了给人们提供这种智能，我们需要更多的计算能力和更稳定的计算能力来实现。

Speaker 1

但我听到的是，你对所有这些类型的扩展都持乐观态度。推理、推理能力，甚至是预训练。

Speaker 2

是的，这里有一个大问题。但基本上有两个调节器是训练和推理扩展，你可以获得收益。在一个我们拥有无限计算资源的世界里，你想要做所有这些事情，比如你有训练，你有推理扩展，训练是一个层次结构，包括预训练、中期训练、后训练。改变模型大小，增加训练数据，训练更大的模型可以让模型获得更多的知识。模型可以说是一个更好的基础模型。以前我们称之为基础模型，它解锁了能力，但你不会说模型在预训练或后训练期间能够解决你最复杂的任务。你仍然有这些其他解锁阶段，比如中期训练或非上下文的后训练，使用L_R_V_R_（低资源虚拟现实）解锁模型在预训练中获得的知识能力。我认为，如果你进行更多的预训练，你会得到一个更好的基础模型，稍后可以解锁，但正如Nathan所说的那样。

变得太昂贵了。所以我们没有无限的计算资源。因此，你必须决定是想把计算资源更多地用在扩大模型上。但这就像一种权衡。在理想的世界里，你想做所有这些事情。我认为在这个意义上，扩展仍然非常重要。你仍然会得到一个更好的模型。但就像我们在G_P_D_ 4.5中看到的那样，这样做并不值得。因为你可以在当前情况下，通过其他技术解锁更多的性能。特别是如果你看推理扩展，这是今年最大的收益之一，它让一个较小的模型比预训练一个更大的模型如G_B_D_ 4.5走得更远。所以我不会说预训练扩展已经过时了，只是现在有其他更有吸引力的扩展方式。但在某个时候，你仍然会想在预训练上取得一些进展。还要考虑的是你想把钱花在哪里。如果你更多地花在预训练上，那就是一个固定成本，你训练模型，然后它就永远具备这种能力。你可以一直使用它等等。通过推理扩展，你在训练期间不花钱，而是在每次查询时花钱，然后就像数学问题一样，我的模型在市场上的时间有多长，如果我在半年内替换它，可能不值得花五百万、一千万、一亿美元来延长训练时间。也许我只会做更多的推理扩展，从中获得性能。可能在用户查询方面花费我两百万。这就变成了一个关于你有多少用户的问题，然后进行数学计算。我认为这也是有趣的地方，J_G_B_D_处于一个位置，我认为他们有很多用户，他们需要降低成本，他们有一个较小的J_B_D_五模型。其他公司如果你的客户有其他权衡，例如还有数学奥林匹克或一些数学问题，J_G_B_T_或者他们可能有一个专有模型，我很确定这只是一个可能经过微调的模型，但大部分是通过推理扩展来在某些任务中实现高性能，这些任务并不总是需要高性能。但总之，我确实认为所有这些预训练、中期训练、后期训练、推理扩展，它们仍然是你想做的事情。只是现在在今年找到一个合适的比例，基本上能让你物有所值。

我认为这可能是定义预训练、中期训练和后期训练的好地方。预训练是经典的逐个预测下一个标记的训练。你有一个庞大的数据语料库，Nathan在这方面也有非常有趣的见解，因为OMO三号。论文的很大一部分集中在正确的数据组合上。所以预训练本质上就是在互联网上的数据、书籍、论文等大量语料库上进行交叉熵损失训练的下一个标记预测。多年来它发生了一些变化，以前人们倾向于把能找到的所有东西都扔进去。现在不仅仅是原始数据，还有合成数据，人们重新，比如说，改写某些东西。所以合成数据不一定意味着纯粹由A_I_（人工智能）编造的数据，它也可以是从一篇文章或维基百科文章中提取内容，然后将其改写为问答问题或总结、反转并以这种方式改进数据。我也认为这就像人类一样，如果有人读了一本书，而不是，比如说Reddit帖子之类的东西。我确实认为你学到的东西没有冒犯，但我认为会有关于这个的帖子。所以我应该我数据是非常珍贵的，非常适合训练。你只需要过滤它。我认为这就是想法。如果有人把它改写成一个更简洁和结构化的方式，我认为这是更高质量的数据，可以让L_M_（语言模型）可能得到相同的结果，但它更快地到达目的地。因为如果语法和标点是正确的，它已经学习了正确的方式，而不是从混乱的方式中获取信息，然后再学习如何纠正这些东西。所以我认为这就是预训练的演变方式，以及为什么扩展仍然有效，因为这不仅仅是数据量的问题，还有让数据对你更有利的技巧。然后中期训练，我的意思是它以前被称为预训练，我认为它被称为中期训练是因为有预训练和后期训练，但中间没有东西，这听起来有点奇怪，你有预训练和后期训练，但实际训练是什么。所以中期训练通常类似于预训练，但我会说它在预训练中更专业。算法是一样的，但你所做的是专注于长篇文档。例如，你有长篇文档的原因是因为你在纯预训练期间没有那么多长篇文档。所以你有一个特定的阶段，L_E_M_S_（语言模型）的问题之一仍然是它是一个神经网络。它有灾难性遗忘的问题。所以你教它一些东西，它会忘记其他东西。你想要的是它不是百分之百的遗忘，但你知道，没有免费的午餐，你不能这也和人类一样。如果你问我十年前学的数学，我不知道，我需要再看一遍。Nathan实际上说他在消耗大量内容，以至于有灾难性遗忘的问题。是的，我试图学习很多关于A_I_M_（人工智能模型）的东西，比如我在学习预训练并行化，我好像失去了一些东西，我不知道那是什么。嗯嗯。预训练，但我不认为有人在生产中这样做。现在只是一些玩具示例，但？要概括并行化，后期训练更像是技能解锁，而预训练更像是吸收知识。本来有一些可能对人们有帮助的事情。很多人认为合成数据对训练模型不利。你提到了深海获取的almo，s O_C_R_（光学字符识别）论文。很多实验室都做过。A_I_二号有一个看起来有多个的原因是因为网上有大量的P_D_F_和其他数字文档，它们的格式不容易用文本编码。所以你使用这些几乎C_R_这些或C_C_O_C_R_，我们称之为L_M_O_C_R_来提取可以是数万亿个标记的候选数据用于预训练。预训练数据集的大小是以数万亿标记来衡量的。研究人员的小模型可以是五到十万亿，QUINN_记录的可以达到五十万亿，有传言说这些封闭实验室可以达到一百万亿标记。只是获取这些潜在数据来放入，我认为他们有一个非常大的漏斗，然后你实际训练的数据是其中的一小部分，比如这种字符识别数据会被描述为实验室中的合成数据用于预训练。然后还有像chat G_P_T_现在给出的精彩答案，你可以在这些最佳答案上进行训练，这就是合成数据。这与早期的chat G_P_T_不同，很多幻觉数据当人们在合成数据中扎根时。

一个有趣的问题是，如果我没记错的话，almost three（几乎三）使用的数据比其他一些开放权重模型，甚至可能比almost two（几乎二）还要少，但它的性能却更好，这可能就是数据如何起作用的一个例子。这主要归功于数据质量。我认为如果我们有更多的计算能力，我们会进行更长时间的训练。我认为我们最终会看到这是一种我们想要做的事情。尤其是对于大型模型，你需要更多的计算能力，因为我们谈论的是更多的参数和知识。本质上有一个比例关系，大型模型可以从数据中吸收更多，然后你会从中获得更多的好处。这就像你脑海中的一个对数图，小模型在测量token（标记）趋势时会更早达到平稳，而更大的模型需要更多。但我们现在并没有用A_I_ two（人工智能二）训练那么大的模型，获取最高质量的数据是自然的起点。

关于数据质量的话题，有没有什么可以说的？还有没有一些容易实现的改进空间？这就像转动曲柄。我认为在开放领域，历史上一直有一个典型的最佳预训练数据集，这个数据集在不同的拥有者之间转移，比如谁拥有最新的或最好的努力，比如A_I_ two的Dolmo（Dolmo是一个项目名称）很早就推出了第一个Olmo（Olmo是一个项目名称），hugging face（一个开源社区）有fine web（一个数据集），还有一个D_C_L_M_项目，它代表data comp language model（数据压缩语言模型），在其他机器学习项目中也有数据压缩，他们有一个非常强大的数据集。很多时候，互联网变得相当封闭，所以我们有common crawl（一个公共数据抓取项目），我认为它有数百万亿的token，你过滤它，看起来像是很多科学工作，你在训练分类器并根据如何修剪这个数据集到最高质量的内容和适合你的任务的内容做出决策。以前，语言模型更多地测试知识和对话，但现在它们被期望能够进行数学和编程。因此，要训练一个推理模型，你需要重新混合整个数据集，这里有很多非常好的方法，你可以从不同来源抽取很多非常小的样本，比如GitHub、stack exchange、Reddit、Wikipedia，你可以从中抽取小样本，并在这些混合样本上训练小模型，测量它们在评估中的表现。你可以做基本的线性回归，这就是你的最佳数据集。但如果你的评估标准改变，你的数据集也会发生很大变化。所以很多old mode three（旧模式三）使用了新的来源来提高数学和编程的推理能力，然后你进行这种混合程序，它会给你

我认为今年在实验室里发生了很多这样的事情。比如有新的热门事物，无论是编程环境还是网络导航，你都需要引入新数据。你需要改变整个预训练，以便后续训练能更好地工作。这就是不断的重新演变和重新确定他们对模型的关注点。

有没有一些有趣的轶事，关于哪些数据来源特别高质量，而我们可能没有预料到的？你提到Reddit有时可以是一个来源。 Reddit非常有用。我认为像P_D_F_s（PDF文件）绝对是一个。尤其是archive（一个开放获取的学术论文存储库）。是的，比如A_I_ two长期以来一直在运行Semantic Scholar（一个学术搜索引擎），它可以说是Google Scholar（谷歌学术）的竞争对手，拥有更多功能。为了做到这一点，A_I_ two找到了很多开放获取的论文PDF文件，并进行了抓取，这些论文可能不在某些出版商的付费墙后。真正开放的科学PDF文件，如果你处理这些文件，你可以从中获得价值。我认为这种风格的工作在前沿实验室很早就完成了，你需要一个非常有能力的研究人员，他们了解如何改变模型，他们引入数据并进行清理，这需要大量的劳动。我认为很多前沿实验室在扩展研究人员时，很多精力都投入到了数据上。如果你想加入一个前沿实验室并产生影响，最好的方法就是找到更好的新数据。然后像O_one（O_一）这样的算法创新是最吸引人的想法。

是的，有一个小组做到了这一点，但我认为大多数贡献在于我要让数据更好，或者我要改善基础设施，以便我的团队中的每个人都能更快地运行实验。

只有授权的数据，而common crawl是对整个互联网的抓取。如果我托管多个网站，我很乐意让他们训练语言模型，但我没有明确授权什么来管理它，因此common crawl在很大程度上是未经授权的，这意味着你并没有真正提供使用数据的同意。还有另一种想法是你可以只在明确授权的数据上训练语言模型，以便提供管理合同，我不确定APRIS是否是

正确的选择或授权的选择。我知道他们这样做的原因是为了符合欧盟的规定，他们希望确保他们的模型符合这些检查之一。

他们说他们只是购买了许可证。我会说他们在线购买了一本书，比如在亚马逊Kindle上购买一本书，或者购买一本采矿书，然后在训练数据中使用。这是一个灰色地带，因为你为内容付费，你可能想要训练它。

但也有一些限制，即使这样也不应该被允许，所以这就是为什么它变得有点模糊，我认为这仍然是一个热门话题，而且像OpenAI这样的公司，他们向私营公司寻求他们的专有数据，私营公司变得越来越保护他们的数据，因为他们知道这将在几年后成为他们的护城河。我确实认为这是一个有趣的问题，

如果L_M_M_s（大语言模型）变得更加商品化，我认为很多人会了解L_M_M_s，会有更多的人能够训练L_M_M_s。当然有基础设施的挑战，但如果你考虑大型行业，比如制药行业、法律、金融行业，我确实认为他们在某个时候会从其他前沿实验室聘请人员，在他们的专有数据上构建内部模型，这将再次成为预训练的另一个突破，目前还不存在，因为即使你想要，你也无法获得这些数据，你无法访问

大多数时候的试验和这些类型的事情。所以我确实认为在这个意义上扩展可能仍然非常活跃，如果你也看看特定领域的应用，因为我们现在仍然只是在看通用的L_M_M_s，比如G_G_P_D_、anthropic等等。它们只是通用的。我认为它们甚至没有触及L_M_M_的表面，如果它是专门为特定任务训练和设计的。

我认为关于数据的问题，这件事情在2025年发生过，我们完全忘记了，Anthropic在法庭上败诉，被判赔偿作者15亿美元。Anthropic购买了成千上万本书并进行了扫描，因为他们购买了这些书，所以在法律上是被允许的，这个过程正在系统中进行。另一方面，他们还通过torrent（种子下载）获取了一些书籍，我认为正是这种torrent行为导致法院判决他们需要向作者支付数十亿美元的赔偿，这真是一个令人震惊的诉讼，来得快去得也快，这么多钱就这样从风险投资（Venture Capital，简称VC）生态系统中流出。

这些核心案件将定义人类文明的未来，因为显然数据驱动了很多事情，而这其中有着非常复杂的人类矛盾。我是说，你可以感同身受，你们都是作者。是的，在某种程度上，你把你的心血、汗水和泪水都倾注在你的写作中，如果有人在没有给予你任何信用的情况下使用你的数据进行训练，这感觉有点像是盗窃。

这里有两个层面。有人可能会买书然后用它进行训练，这可以说是公平或不公平，但还有一些公司直接使用盗版书籍，这甚至没有对作者进行补偿。我认为这就是人们对此感到愤怒的原因。

必须有某种竞争机制。这就像Spotify最初为音乐流媒体所做的那样。你知道，这种竞争是什么样的？你必须定义这些模型，必须仔细考虑所有这些。我认为人们普遍好奇的另一个问题是，我很想听听你的看法，随着大型语言模型（Large Language Models，简称LLMs）的使用越来越多，如果你看看甚至是archive（档案），但GitHub上越来越多的数据是由LLMs生成的，在这样的世界里你会怎么做？

这会是多大的问题？

从基础设施和系统的角度来看确实有问题，但从人工智能（Artificial Intelligence，简称AI）的角度来看，这种情况是不可避免的。

所以基本上是由LLMs生成的数据，本质上是由人类进行策划的，对吧？

是的，我认为很多开源贡献者确实感到精疲力竭。如果你有一个受欢迎的开源项目，有人会说哦，我想做开源AI，这对我的职业生涯有好处，他们就随便写点代码然后提交给你，你可能比我更常遇到这种情况。我这里有个案例研究，我有一个叫做MLXTEND的库，大约十到十五年前我还是学生时开发的。对于某些算法来说，它仍然是一个相当受欢迎的库，尤其是在频繁数据挖掘方面。最近我想有两三个人在很短的时间内提交了很多拉取请求（Pull Requests，简称PRs）。我确实认为LLMs参与了这些PRs的提交。作为维护者，我有两种感觉。首先，我有点不知所措，因为这是一个较老的库，对我来说不是优先事项。同时我也有点欣赏，因为我认为人们忘记的是，不仅仅是使用LLMs，仍然有一个人类层面来验证某些东西，这在某种意义上也是数据标注的方式。所以，这就像是获取标注数据用于强化学习（Reinforcement Learning，简称RL）反馈阶段的最昂贵的事情之一。而这就像是经过多个阶段，然后你实际上得到了更高质量的数据。

我在某种意义上不介意它。它可能会让人感到不知所措。但我确实认为这其中也有价值。

感觉上，纯粹由LLMs生成的数据和经过人类验证的LLMs生成的数据之间存在根本区别，即使这种验证只占代码行的很小一部分。

我认为这适用于任何事情，比如人们有时会想哦，我可以只用一个LLM来学习XYZ，这是真的，你可以，但可能会有一个专家使用LLM写了一些特定代码。这里面有一种人类的工作，把它变得更好，去掉不那么好的部分，预先为你消化，这为你节省了时间。我认为这就是价值所在，比如你读一篇文章，比如sub-stake文章。我可能会让一个LLM给我一些意见，但我甚至可能不知道该问什么。我认为相比于我去找LLM，阅读那篇文章仍然有价值，因为你是专家，你选择了哪些知识是准确的应该被包括在内，并给我这个非常简洁的总结。这是一种巨大的价值，因为现在我不必浪费三到五个小时自己去阅读，可能还会得到一些错误的信息等等。我认为这也是作家的未来所在，即使有LLMs，专家仍然可以为你节省时间。

实际上观察这一点很有趣，我相信你们也这样做，但对我来说，看摘要和原始内容之间的差异，即使是长达一页的内容摘要，看看摘要如何减弱了内容的锋芒，比如它去掉了什么信号。

我经常谈论的是声音。

声音，好吧，声音，我很想听听你对声音的理解，这真的很有力量，但有时候会有一些真正的见解。比如在去掉一个见解时，你实际上从根本上改变了事情的意义。所以我一直对LLMs在真正抓住核心见解方面的表现感到失望，而这正是一个优秀的摘要所做的。即使你去，我有这些非常详细的提示，我真的在努力挖掘。

但仍然没有达到那个水平，这涉及到一个深刻的哲学问题，即什么是人类的知识和智慧，成为内在的意义是什么等等。但当你谈论声音时，你的意思是什么？

当我写作时，我认为我试图做的很多事情是将你作为研究人员的想法转化为文字，这些想法非常原始，研究人员试图在他们理解的前沿封装一个想法，并试图将一种感觉转化为文字。我认为我的写作，我试图做到这一点，使其显得原始，但也信息量很大，以一种有些人会理解有些人不会的方式，这就是研究的本质。我认为这是语言模型不擅长的事情，特别是它们都是通过人类反馈的强化学习（Reinforcement Learning from Human Feedback，简称RLHF）训练的，这种训练旨在从很多人那里获取反馈，并在某种程度上平均化模型的行为。我认为在这种过滤下，模型很难做到非常尖锐。我认为这对RLHF的研究人员来说是一个美妙的根本问题，这在提高模型方面提供了很多实用性，但问题的形成也是一种挑战。

这些语言模型没有这种先验知识，它们在尝试表达深层次的东西。我不认为这是不可能的。我觉得有些模型的表现确实让人震惊，比如我很想试试Bing Sydney。它是否更有个性？因为它经常会在与人交流时偏离轨道，以一种历史上显然令人害怕的方式，比如告诉一位记者离开他的妻子，这样的模型可能会被用于普通用途。但这就像是一种权衡，比如这个R_L_H_F_（Reinforcement Learning from Human Feedback，人类反馈强化学习）过程在某些方面增加了限制。

作为这些前沿实验室和公司的成员，这种情况是令人恐惧的，因为有数百万人在使用这些模型。去年G_P_T_四O_被移除时引起了很大反响，我个人没有使用过这个模型，但我和OpenAI的人聊过，他们说有用户在半夜发邮件给他们，可能察觉到了部署中的微妙差异，他们会说“我的朋友变了”，然后找到这些员工的邮件地址并发送信息，因为他们对这套模型权重和配置非常依赖。

我们在TikTok上也看到了这种情况。你打开它，我不使用TikTok，据说在五分钟内算法就能抓住你。就像是锁定了一样。我不认为这些是语言模型在做推荐。我认为可以用语言模型做到这一点。在和它聊天五分钟内，模型就能了解你。这是人们还没有准备好的事情。我觉得如果像孩子一样，不要给孩子用，至少在我们知道发生了什么之前不要给他们用。

他们会说，自杀是因为L_O_M_（Language Model，语言模型）导致的。这将导致公司因为法律问题等，越来越多地削弱L_O_M_的锋芒。操作这个领域非常困难，因为你当然不希望L_O_M_在那个层面上对人类造成伤害。但这也是人类体验的本质，要有丰富的对话，一个能挑战你并让你成长的对话，需要这种锋芒。这是A_I_（Artificial Intelligence，人工智能）研究人员在R_L_H_F_方面必须解决的一个极其困难的问题。因为你实际上是在处理人类的状况。

这些公司里的很多研究人员都非常有动力，像OpenAI这样的公司文化上非常希望通过此为世界做好事。我不想在这个领域工作，因为一方面很多人将A_I_视为健康盟友，可以与之私密地谈论健康，但这又涉及到心理健康等问题，这是一种可能会让人走向极端的事情，但也可能拯救其他人。我不想训练图像生成模型并公开发布，因为我不想让某人在他们的笔记本电脑上拥有一个可以伤害他人的工具，我的公司没有安全地做到这一点的基础设施，但在很多领域都需要有人以复杂性和坚定的信念来处理这些问题。

作为使用这些技术的社会，我们需要确保进行复杂的对话，而不是仅仅制造恐慌。大科技公司正在对人类造成伤害或窃取你的数据，事情比这更复杂。你说得对，这些公司内部有很多人，其中很多是你认识的，他们非常关心帮助人们。他们考虑到全世界人们的完整人类体验，而不仅仅是硅谷或美国的人们。

我希望A_I_的时机与大科技公司与普通人的关系不同。大科技公司的声誉如此低，而A_I_又如此昂贵，这不可避免地会成为大科技公司的事情，因为它需要大量资源，人们说美国正在“赌”经济在A_I_的建设上，这两者同时交织在一起，使得沟通环境非常困难。我应该去和更多讨厌大科技公司的人交流，他们把C_A_I_视为这种情况的延续。

你提到的一个建议是，在整个系统中找到主动性，而不是以无力的方式坐视A_I_迅速接管互联网。通过使用A_I_来构建东西，比如应用程序，这不仅能帮助你建立直觉，还能让你更有力量，因为你可以理解它是如何工作的，了解它的弱点，这让你的声音有力量去说这不好，这不好的技术使用，而这是好的技术使用。你更深入地融入系统，因此你可以更好地理解它，并更好地引导它。

你提到的主动性是个好点子。与其忽视它并说“好吧，我不会用它”，我认为长期来看更健康的做法是说“好吧，它已经存在了，我无法让它消失”，就像当年互联网和计算机刚出现时一样。我如何最好地利用它，它如何帮助我提升自己？我担心的是，如果你完全用它来做你喜欢的事情，那么你喜欢的事情可能就不再存在了，这可能会导致倦怠。例如，如果我用它来为我完成所有的编码工作，那么就没有编码了，我只是在管理一个为我编码的东西。两年后，如果我每天这样做八小时，我是否仍然感到充实？这是否会影响我对工作的热情，对我所做事情的自豪感？

关于享受这个话题，最近有一项对791名专业开发者的调查，专业是指有十年以上经验的人。这在当今时代算是很长的时间。调查结果在很多方面都令人惊讶。他们将其分为初级和高级开发者。但这表明，无论是初级还是高级开发者，都在他们发布的代码中使用了A_I_生成的代码。这不仅仅是为了好玩或中级学习的东西。

Speaker 1

所以，这意味着大约25%，大多数人使用的比例在50%或更多。而有趣的是，在你发布的代码中超过50%由AI生成的类别中，高级开发人员更有可能这样做。但你不希望AI夺走你所热爱的事情。我认为这反映了我的经验，接下来我要说的这些结果。大约80%的人发现使用AI作为工作的一部分要么稍微更愉快，要么显著更愉快。

我认为这取决于任务，比如我个人的使用情况。我有一个网站，有时我会在网站上做一些调整。我个人不喜欢这样做。所以在这种情况下，如果AI能帮助我在网站上实现一些东西，我非常乐意。这很棒。但同时，当我解决一个复杂的问题时，比如有一个bug（程序错误），我追踪这个bug并找到它时，那是世界上最好的感觉。就像你获得了极大的快乐一样。但如果你甚至不去思考这个bug，直接去找LLM（大型语言模型，Large Language Model），你就不会有这种感觉，对吧。但也可能有一个中间地带，比如你自己尝试，找不到，然后使用LLM，这样你就不会感到沮丧，因为它帮助了你，你可以继续做你喜欢的事情。所以我认为看这些统计数据，我认为差别在于没有考虑到所有不同场景的平均值，我们不知道这是核心任务还是其他人本来就不喜欢的琐碎事情。因此，从某种意义上说，AI非常适合处理那些需要大量工作的琐碎事情。

比如说，我妻子前几天有一个关于书的播客，比如书籍讨论，一个读书俱乐部，她正在把节目笔记从Spotify转移到YouTube，然后链接不知怎么坏了。在某些集数中，因为是自定义的，有很多书，大约一百个链接，手动修复每个链接会非常痛苦。所以我建议我们试试ChatGPT，我们把文本复制到ChatGPT中，它修复了这些链接，而不是花两个小时一个一个修复链接，它让这种工作变得更加流畅，没有沮丧。我认为每个人都有一个AI可以派上用场的例子，比如这种非常无聊、琐碎的事情。

对我个人来说，既然我们在谈论编码，你提到了调试，对我来说很多乐趣来源于光标一侧而不是代码一侧，我有一个朋友，我有一个叫什么的，配对程序员。这样就不那么孤单了。你把调试描述得像是巨大的乐趣。不，我会说调试就像在沙漠中走了几天后喝到的一口水。所以你跳过了整个沙漠部分，那是你在受苦。所以有时候有一个朋友虽然找不到bug，但可以给你一些关于代码的直觉，你们一起穿越沙漠，然后一起找到那口水。所以至少对我来说，也许这反映了编程体验的孤独感，这是一种快乐的来源。

嗯嗯。

如果你能解决它，那就太好了。但也有一个甜蜜的Goldilocks区间，如果太难了，那就是在浪费你的时间。但我认为这是另一个挑战。人们将如何学习？我 mean 我们看了那个图表，我们看到更多高级开发人员发布的AI生成代码比初级开发人员多，我认为这非常有趣，因为直觉上你会认为是初级开发人员，因为他们还不知道怎么做，所以他们使用AI来做这件事。这可能意味着AI还不够好，无法解决这个任务，但也可能意味着专家更有效地使用它。他们知道在哪里和如何更好地使用它并审查代码，然后他们更信任代码。所以我认为未来社会的一个问题是，如果你从未尝试自己做这件事，你如何成为专家。我认为一种方法是，我自己学习的方式是通过自己尝试，比如数学教科书，如果你看了解决方案。是的，你学到了一些东西，但我认为如果你先尝试一下，然后你会以不同的方式欣赏解决方案，因为你知道如何将其纳入你的思维框架。如果LLM一直在这里，你真的会经历那种挣扎吗？你愿意去挣扎吗？因为挣扎并不好受，对吧。我 mean 这就是挣扎。如果你用LLM做所有事情，在某种程度上，你永远不会真正迈出下一步。然后你可能不会获得作为专家使用LLM时的那种突破。所以嗯，就像有一个Goldilocks甜蜜点，也许这里的诀窍是你每天安排两小时的离线学习时间，其余时间使用LLM。但我认为这很重要，人们仍然要投资于自己，而不是仅仅依赖LLM。

是的，我们作为一个文明，个人都需要找到那个Goldilocks区间，在编程背景下作为开发者。我们有过这场从预训练到中期训练的精彩对话。现在让我们进入后期训练。后期训练中有很多有趣的东西。那么后期训练中有哪些有趣的想法呢？

嗯嗯。

很多这种迭代生成评分循环，让模型在工具使用和软件方面学习有趣的行为。这可能是搜索、自己运行命令并查看输出，然后这种训练很好地支持了推理时间扩展。事实证明，这种范式在这里很好地联系在一起，其中这种RL（强化学习，Reinforcement Learning）训练支持推理时间扩展。但推理时间扩展可能以不同的方式被发现。所以这是一个完美的风暴，模型发生了很大变化，它们的训练方式是一个主要因素，这极大地改变了人们对后期训练的看法。

你能描述一下由deep seek R_ one推广的R_L_V_R_吗？你能描述一下它是如何工作的吗？

是的，有趣的事实，我曾在提出R_L_V_R_这个术语的团队中，这是在deep seek之前的二到三项工作中，我们不认为自己是推广RL扩展的人，但作为学者的一个乐趣是能够命名和影响讨论，因为封闭实验室只能说那么多，而作为学者你可以做的一件事是，你可能没有计算能力来训练模型，但你可以以一种最终被接受的方式来框定事情，但就像一个社区可以围绕这个R_L_V_R_术语聚集在一起，这非常有趣。然后deep seek是进行训练突破的人，他们扩展了强化学习，即让模型生成答案，然后对完成的结果进行评分，如果正确，那就是你的强化学习奖励。强化学习经典上是一个代理在环境中行动，环境给它一个状态和奖励，你试图最大化这些。

奖励。在语言模型的情况下，奖励通常是基于一组可验证任务的准确性，无论是数学问题、编码任务，还是在某些模糊领域，如事实领域，这些在某种程度上也是可验证的，或者是对指令的限制，比如只用以字母A开头的单词来回应。所有这些在某种程度上都是可验证的，核心思想是你找到更多这样的可行问题，让模型在进行这些R_L_（强化学习，Reinforcement Learning）步骤时多次尝试，这些R_L_梯度更新的基础设施是从人类反馈的强化学习演变而来的。在那个时代，他们试图优化的分数是一个聚合人类偏好的学习奖励模型。因此，你可以改变问题领域，这让优化可以在更大规模上进行，这在某种程度上启动了模型能力和人们使用方式的重大变化。

R_L_V_R_（可验证奖励强化学习，Reinforcement Learning with Verifiable Rewards）适用于哪些领域？数学和代码是著名的领域，还有很多工作集中在所谓的评分标准上，这与人们可能听过的L_M_（语言模型，Language Model）作为评判者有关，比如对于每个问题，我的训练数据集中会有一组问题。然后我会有另一个语言模型，问它这个问题的好答案应该是什么样的。然后你可以多次尝试这个问题，并根据这个评分标准分配分数。这并不一定像数学和代码领域那样可验证，但这些评分标准和其他可能更模糊的科学问题是很多关注的焦点，他们试图将这套方法推向这些更开放的领域，以便模型可以学习更多。

我想这被称为带有A_I_（人工智能，Artificial Intelligence）反馈的强化学习，对吧？这是一个较早的术语，由Anthropic的宪法A_I_论文提出。很多事情都是周期性的。

回到R_L_L_V_R_（可验证奖励强化学习）。我认为这里有趣且美妙的是，你可以问L_M_一个数学问题，然后你知道正确答案。你让L_L_M_（大型语言模型，Large Language Model）像你说的那样去解决它。但它如何做到这一点，我的意思是你并没有对它施加太多限制。你可以添加一些限制，比如使用相同的语言，不要在西班牙语和英语之间切换。但可以说你几乎不干预，只给出问题和答案。然后L_M_M_（大型语言模型）就必须完成任务以得出正确答案。但这里美妙的是，实际发生的情况是L_M_会进行逐步描述。就像一个学生或数学家如何推导解决方案一样。它会使用这些步骤，这实际上有助于模型提高自己的准确性。然后就像你说的推理扩展。推理扩展大致意味着在使用L_M_进行推理时花费更多的计算资源。这里的推理扩展是使用更多的tokens（标记）。而且我认为在R_一篇论文中，他们展示了训练模型的时间越长，响应就越长。它们随着时间的推移而增长，使用更多的tokens。这样对于简单任务来说成本变得更高。但这些解释有助于模型提高准确性。还有很多有趣的论文显示，模型的解释不一定是正确的，或者可能与答案无关。但出于某种原因，它仍然帮助模型。这是因为它在解释。我认为这也有点像我们人类的操作方式，对吧。如果在数学课上有一个复杂的数学问题，你通常会有一张便笺纸，逐步解决，划掉错误的地方。模型也会自我纠正，我认为这是R_一篇论文中的“啊哈时刻”，他们称之为“啊哈时刻”，因为模型自己意识到它犯了错误，然后说“啊，我做错了，让我再试试”，我觉得这很酷，因为这只是给出正确答案并让它自己找出怎么做的结果，虽然L_M_s不像人类那样思考，但这有点像一个有趣的巧合，另一方面，一个很好的副作用是这对我们人类来说很棒，因为看到这些步骤建立了信任，我们也可以学习和检查。

这里有很多内容，我认为今年关于语言模型的争论很多，比如这些“啊哈时刻”是假的，因为在预训练中你基本上看到了整个互联网，所以你肯定见过人们解释他们的工作，甚至是口头的，比如数学讲座的转录，你尝试这个，哦，我搞砸了，而强化学习R_L_V_R_非常擅长的是放大这些行为，因为它们在启用模型进行更长时间的思考和检查工作时非常有用，我同意这种训练让模型学会以一种非常有用的方式放大这一点，以便最终的答案更好。

我还可以给你一个实际的例子，我用R_L_V_R_在数学五百上训练GRAND三基模型，基模型的准确率大约是15%，只用了五十步，在几分钟内，通过R_L_V_R_模型从15%提高到50%的准确率，你不能告诉我它在这么多p中学到了关于数学的基本知识。

今年有两篇论文，其中一篇是我参与的，讨论了QUIN中的数据污染，特别是在这个特殊的中期训练阶段进行训练的问题几乎与数学相同。

有多篇论文讨论污染问题，比如你能相信它们多少，我认为这就是R_L_V_R_声誉与格式化有关的原因，因为你可以如此迅速地获得这些收益，因此它一定已经在模型中。但这里有很多复杂性，我们并没有真正进行受控实验，所以你并不真正知道。

如果这不是真的，我会说蒸馏不会起作用，对吧，我的意思是蒸馏在某种程度上可以起作用，但我认为最大的问题是我研究了这种污染，因为我们不知道数据中有什么，除非你有一个新的数据集，否则真的不可能。而且你提到的数学数据集是有一个问题和一个答案，并给出了一个解释。但即使是像M_M_L_U_（多项选择基准，Multiple-choice Machine Learning Benchmark）这样简单的东西，如果你稍微改变一下格式，比如用一个点代替括号，模型的准确率会有很大不同。

我认为这可能是一个模型问题，而不是一个普遍问题。

这甚至不是N_M_（神经模型，Neural Model）开发者的恶意行为，比如“嘿，我们想在那个基准上作弊”，只是它在某个时候见过一些东西，我认为评估N_M_M_（神经模型）唯一公平的方法是有一个新的基准，在N_M_M_部署后的截止日期之后。

我们能否列出所有那些在后训练中会用到的东西的配方，你提到R_R_L_V_R_是一个非常令人兴奋的有效方法，也许我们应该详细说明R_L_H_F_（人类反馈的强化学习，Reinforcement Learning from Human Feedback）仍然有一个非常重要的组成部分。后训练中还有哪些其他想法？

嗯嗯。

Speaker 1

所有这些结合在一起，但总结来说，中期训练是给模型提供所需的技能，然后学习R_L_（强化学习）并验证奖励是让模型进行大量的尝试，所以投入大量计算资源进行跨越困难问题的试错学习，然后R_L_H_F_（强化学习人类反馈）就像是完成模型，使其易于使用，并对模型进行完善。你能评论一下R_L_V_R_（强化学习验证奖励）所需的计算量吗？

Speaker 2

计算量只会越来越大，所以我认为GRAC四代曾经著名地说过，他们在预训练和后训练中使用了相似的计算量。回到扩展讨论，他们涉及非常不同的硬件来进行扩展。预训练非常依赖计算能力，这就像是FLOPS（每秒浮点运算次数）的讨论，这就是你一次可以进行多少次矩阵乘法。因为R_L_是在生成这些答案，你在真实世界环境中训练模型，最终更依赖于内存，因为你在生成长序列，注意机制在处理更长序列时会导致内存呈二次方增长。所以计算变得非常不同。在预训练中，我们会谈论一个模型，我认为如果我们回到类似政府行政命令的内容，大约是10的25次方FLOPS来训练一个模型。如果你在后训练中使用FLOPS，这就显得很奇怪，因为现实中就是你分配了多少小时，多少GPU（图形处理单元）。我认为在时间方面，R_L_的计算量越来越接近，因为你无法将其放入一个系统中。预训练是如此密集，所有GPU彼此通信，非常高效，而R_L_有很多活动部分，生成十万个token（标记）的序列可能需要很长时间。如果你考虑G_B_T_五点二专业版需要一个小时，就像如果你的训练运行有一个小时的样本，你必须确保它高效处理。所以我认为在GPU小时或挂钟时间方面，R_L_运行可能接近预训练的天数，但它们可能不会同时使用那么多GPU。实验室中的经验法则是，你不希望预训练运行超过一个月，因为它们会灾难性地失败。如果你计划一个巨大的集群运行两个月，然后在第50天失败，机会成本太大了。所以你不想把所有的鸡蛋放在一个篮子里，就像G_P_T_四代是最终的冒险运行，之前没有人想这么做，因为训练花了三个月，大家都很震惊它成功了，而现在人们更倾向于渐进式。

所以R_L_V_R_更像是无限制地训练或获得收益，而R_L_H_F_因为是偏好调整，你达到某个点后，继续投入R_L_L_预算就没有意义了。回到偏好调整，有多个人可以对同一件事给出多种解释，它们都可能是正确的，但在某个时候你学到了一种风格，再继续迭代就没有意义了。我最喜欢的例子是，如果亲戚问我该买什么笔记本电脑，我会给他们一个解释，问他们的使用场景，比如他们可能优先考虑电池寿命和存储。我们可能会优先考虑RAM和计算能力。但两种答案都是正确的，不同的人需要不同的答案。在偏好调整中，你试图平均化，比如你让数据标注员给出正确的，不是正确的，而是偏好的答案，然后你在此基础上训练，但在某个点上，你学到了平均偏好的答案，没有理由继续在这上面训练，因为这只是一个风格，而在R_L_V_R_中，你实际上是让模型解决越来越复杂和困难的问题，所以我认为长期分配更多预算给R_L_R_V_R_更有意义。而且现在我们处于L_L_R_V_R_一代的阶段，这仍然是一个简单的问题和答案，但我们没有处理中间的部分。有多篇研究论文，比如谷歌的，关于过程奖励模型，它们也为解释的正确性打分，我认为这将是今年的下一个方向，比如R_L_V_R_二代。专注于问题和答案之间，如何利用这些信息，解释来提高解释的准确性。但这只是一个角度。还有一个深度搜索数学二的论文，他们也有有趣的推理扩展，他们首先开发了自我评分的模型，一个独立的模型，我认为这将是一个方面，另一个方面就像Nathan提到的，我们在L_R_中进入其他领域。

人们感兴趣的地方是价值函数，这与过程奖励模型非常相似。过程奖励模型为推理过程中的每个中间步骤分配好坏，而价值函数为语言模型生成的每个token赋值。在语言建模和推理模型时代，这两者都没有被广泛验证。人们现在对价值函数更乐观。我认为过程奖励模型在这个推理模型之前的时代尝试得更多，很多人对此感到头疼。我认为这很大程度上是因为价值模型在人类强化学习中有非常深厚的历史。它们是深度强化学习存在的核心之一，就是训练价值模型。所以现在的文献中，人们对尝试价值模型感到兴奋，但几乎没有证据支持这一点。而在尝试扩展过程奖励模型时有负面的例子。这些事情并不总是成立。

Speaker 1

嗯嗯。

Speaker 2

嗯嗯。

Speaker 1

但是对于R_L_H_F_没有扩展法则，如果你长时间增加计算量，你会获得一些性能。事实上，R_L_H_F_的开创性扩展论文是奖励模型过度优化的扩展法则。所以这是一个与R_L_V_R_和我们现在及未来的方法之间的巨大区别，比如它们会遵循扩展范式，你可以让最好的运行多运行十倍，你会获得几倍的性能，但你不能对R_L_H_F_这样做。这将定义这个领域和人们的研究方法，我是一个鼓励人们在学术上进行R_L_H_F_的支持者。一个很好的描述是，要做到最好的R_L_H_F_，你可能不需要额外的十倍或一百倍的计算，但要做到最好的R_L_V_R_，你需要。所以我认为有一篇我称之为开创性的论文，来自一个元实习项目，叫做“用语言模型扩展强化学习的艺术”，他们描述的框架是扩展R_L_，他们的增量实验是十到两百小时，这就像每个实验花费数千或数万美元，他们做了很多，这样的成本对普通学术界来说是无法承受的，这是一个艰难的平衡，试图从每个社区中学习。

我在想我们是否可以在这个时候稍微偏离一下主题，谈谈教育和学习。如果你是一个对编程和人工智能（A_I_）感兴趣的聪明人，我想从头开始构建一些东西是一个不错的开始。那么，你能告诉我你会推荐人们怎么做吗？

我个人会像你说的那样，从头开始实现一个简单的模型，可以在你的电脑上运行。构建一个模型的目标并不是为了让你每天在个人项目中使用，比如它不会成为你的个人助手来替代现有的开放权重模型或J_G_P_D_。而是为了让你了解L_M_M_（语言模型）的内部运作，了解预训练是如何进行的，最好是在你自己的电脑上进行。然后，你可以学习预训练、监督微调、注意力机制，这样你就能对这些东西有一个扎实的理解。但在某个时候你会遇到瓶颈，因为小模型的能力有限。学习大规模L_L_M_（大型语言模型）的问题在于，我会说，构建一个更大的模型复杂度呈指数级增长，因为不仅仅是模型变大了。你还需要考虑如何在多个G_P_U_（图形处理单元）之间分割参数。即使是K_V_缓存（键值缓存），也有多种实现方式。一种是为了理解它的工作原理，通过比如说连接列表来逐步扩展缓存，但这在G_P_U_上并不优化，你不会这样做。你会预先分配一个张量，然后填充它。但这又增加了二三十行代码，每添加一个功能就会增加很多代码。我认为书中的诀窍基本上是理解L_L_M_的工作原理。它不会是你的生产级L_L_M_，但一旦你掌握了这些，你就能理解生产级的L_L_M_。

所以你一直在尝试构建一个适合单个G_P_U_的L_L_M_？

是的，我的大多数模型都是这样的，我有一些关于M_O_E_（专家混合模型）的额外材料。我想其中一两个可能需要多个G_P_U_，但目标是让它适合一个G_P_U_。而且美妙的是你可以自我验证。这几乎就像R_L_V_R_（强化学习验证）一样，当你从头开始编写这些代码时。你可以从hugging face transformer库中获取一个现有模型。hugging face transformer库很棒，但如果你想学习L_M_S_（语言模型），我认为这不是最好的起点，因为代码太复杂了，因为它必须适应很多用例。有些人还在生产中使用它，它必须非常复杂，交织在一起，很难阅读，不是线性的。

它最初是作为一个微调库开始的。然后它成长为每种模型架构的标准表示及其加载方式。所以hugging face就像是获取模型的默认地方，而transformers是使其成为可能的软件，所以人们可以轻松加载模型并进行一些基本操作。

所有拥有开放权重模型的前沿实验室都有一个hugging face transformers版本，比如从deep seek到G_P_T_O_S_S_。这就像是你可以在那里加载的标准权重。但即使是transformers库，也不用于生产。人们使用S_G_ lang或V_L_L_M_，这又增加了一层复杂性。

我们应该说transformers库有大约四百个模型。

所以这是一个尝试实现大量L_L_M_的库，因此你有一个庞大的代码库，基本上是巨大的，可能有成千上万行代码，要理解你想理解的部分就像是在大海捞针。它的美妙之处在于你有一个可工作的实现，因此你可以从中反向推导。我推荐的做法或者我自己也会做的是，如果我想了解例如almost three是如何实现的，我会查看模型中心的权重、配置文件，然后你可以看到哦，他们使用了这么多层，他们在这种情况下使用了组查询注意力或多头注意力，然后你可以看到所有组件在一个人类可读的配置文件中，比如一百行配置文件，然后你从你的G_P_D_二模型开始添加这些东西。这里很酷的是你可以加载预训练权重，看看它们是否在你的模型中工作。你想匹配你从transformer模型中获得的相同输出，然后你可以将其用作验证你的架构正确性的依据。有时候这可能需要我一天时间，almost three的挑战是位置嵌入的rope。他们有一个yarn扩展，并且有一些自定义的缩放，我无法完全匹配这些东西。在这个过程中你会理解很多东西，但最终的酷点在于你知道你是正确的，因为你可以进行单元测试，可以检查参考实现。我认为这可能是学习的最佳方式之一，基本上就是逆向工程某些东西。是的。

我认为这是今天对A_I_感兴趣的人都应该做的事情。我喜欢你的书的原因是我来自R_L_（强化学习）和机器人领域的语言模型。我从未花时间去学习所有的基础知识，而transformer架构我描述为像深度学习一样基础的东西，我过去必须学习，人们需要这样做。我认为很多人感到不知所措的地方在于如何应用这些知识来影响或找到职业道路，因为A_I_和语言模型让这些基础知识变得如此容易接触，有动力的人会去学习它。然后就是如何在研究中贡献力量。我对此其实相当乐观，因为这个领域发展得如此之快，很多时候最优秀的人并没有完全解决一个问题，因为有一个更大的低垂果实问题需要解决，所以他们继续前进。我在早期Jeff的书中尝试做的很多事情是描述后训练技术以及人们如何思考它们对模型的影响，以及人们在做什么，然后令人惊讶的是有多少事情我认为只是人们停止研究它们或没有研究它们。所以我认为在掌握基础知识后，尝试专注于某个领域是好的，然后阅读相关论文并参与生态系统中。实际上，在线上与领先研究人员的接触是随机的，没有人知道X_和M_L_上匿名账户的所有者是谁，可能只是一些深入研究这些东西的随机人。尤其是使用A_I_工具，只是说我不明白这点，然后深入研究它。我认为这是非常有用的事情。但有很多研究领域可能只需要阅读三篇论文，然后其中一位作者可能会回复你的邮件。但你必须在这些邮件中投入大量努力去理解这个领域。

对于新手来说，可能需要几周的时间才能真正掌握一个非常狭窄的领域，但我认为在掌握基础知识后，深入研究一个狭窄的领域对人们会非常有用。比如，我对角色训练非常感兴趣，这涉及到如何让模型变得有趣、讽刺或严肃，以及如何处理数据来实现这些。牛津的一名学生联系了我，对此很感兴趣，我给了他一些建议，现在那篇论文已经存在了。世界上可能只有两三个人对此非常感兴趣。他是一名博士生，这给了他一些优势，但对我来说，这是一个我一直在等待有人来投入时间研究的主题。我相信还有很多非常狭窄的领域，你会觉得怎么可能没有人研究过这个问题。我认为信息量太大，很多人觉得无法抓住任何东西，但如果你专注于一个领域，我认为会有很多有趣的东西可以学习。

是的，我认为你不能尝试做所有事情，因为这会让人不堪重负，如果你试图跟上所有事情，你会感到精疲力竭。比如我自己，我已经很久没有关注计算机视觉了，只专注于L_M_s（语言模型）。但回到你的书，我认为这也是一本非常棒的书，非常物有所值，因为如果你想了解R_L_H_F_（基于人类反馈的强化学习），我不会去读R_L_H_F_的论文，因为那样你可能要花两年的时间。

我们会看到哪些内容是真实的。我们来看看目录中可能错过的一些想法，尤其是在训练后的大图景中。首先是问题设置、训练概览、什么是偏好、数据和优化工具、奖励建模、正则化、指令调优、拒绝采样、强化学习、I_E_（政策梯度）、直接对齐算法，然后是宪法A_I_和A_I_反馈、推理和推理时间扩展、工具使用和函数调用、合成数据和蒸馏、评估，然后是关于优化风格和信息的开放问题部分，以及产品、U_X_、角色和训练后。所以有哪些值得一提的想法可以连接教育和研究部分？你提到了角色训练，这很有趣。

角色训练很有趣，因为关于它的内容很少，但我们讨论了人们如何与这些模型互动，比如我们使用它们时感觉很好，因为它们是积极的，但这可能会走得太远，变得过于积极。基本上就是你如何改变数据和决策来实现你想要的结果。Open A_I_有一个叫做模型规范的东西，基本上是他们内部的指导方针，告诉模型应该做什么，并将其发布给开发者，所以你可以知道Open A_I_的训练目标是什么，他们的意图是什么，尽管还没有实现，而不是他们真正想要做的事情。这种透明度很好，但如何整理这些文件以及遵循它们的难易程度并不为人所知。我认为这本书的设计是强化学习章节显然是人们想要的，因为大家都听说过R_L_V_R_。虽然是同一个算法，但可以在非常不同的文件中使用。我认为R_L_H_F_的核心偏好是偏好有多混乱，这基本上是我几年前写的一篇论文的重述。但这基本上是告诉你为什么R_L_H_F_永远无法完全解决的章节，因为即使是R_L_的设置也假设偏好可以量化，并且多个偏好可以简化为单一值。我认为这与经济学文献中的冯·诺依曼-摩根斯坦理论有关。

这就是所有哲学、经济学和心理学背景的章节，它告诉你在做R_L_H_F_时被压缩了什么。所以你有了这些，然后在书的后面部分，你用这个R_L_地图来提高数值。我认为这就是为什么我认为人们做研究会很有收获，因为量化偏好是人类设计的问题，以便于研究偏好。但在如何进行这项研究上存在一些根本性的争论。比如在语言模型的响应中，你关心不同的事情，比如准确性或风格，而在收集数据时，它们都被压缩成“我更喜欢这个”。这种情况正在发生，世界上其他领域有很多哲学和研究探讨你应该如何做这件事。我认为社会选择理论是经济学的一个子领域，研究如何聚合偏好。我参加了一个研讨会，他们发布了一份白皮书，我在想能否将社会选择理论用于R_L_H_F_？所以我主要希望对数学感兴趣的人能来这里，发现并学习这种更广泛的背景。我认为有趣的是，我保留了一份我喜欢的推理模型技术报告的清单。在第十四章中，有一个关于R_L_V_R_的简短总结，我列出了我喜欢的每一个推理模型。所以我认为在教育中，很多时候需要的是我喜欢的东西，因为语言模型在数学上表现得非常好，比如著名的直接偏好优化论文，这是一种比R_L_更简单的解决问题的方法，附录中的推导跳过了一些数学步骤，我尝试为这本书重新推导这些公式，结果发现他们使用了一种改变数学的对数技巧，但用语言模型来做，他们就像这是对数技巧，我不确定我是否喜欢这种数学被商品化的感觉。我认为在阅读附录和跟随数学推导的过程中遇到的困难对学习是有益的。

一些提供商开始研究用于教育的模型，这些模型设计上不会一次性给出所有信息，而是让人们努力去获取。我认为你可以训练模型来做到这一点，这将是一个很好的贡献，比如书中的所有内容，你必须重新评估每一个决策，这是一个很好的例子。我认为我们在A_I_也有可能研究这个，我觉得这会很酷。

嗯嗯。

完全可行，但问题在于这需要自律，很多人做数学是因为他们喜欢，但也有很多人是为了完成作业，然后就会走捷径。是的，我们可以开发一个教育L_L_M_，但其他L_L_M_仍然存在，仍然会有人使用其他L_L_M_的诱惑。

他们了解自己热爱的东西，对此有自知之明，并且明白这不应该是轻而易举的。我认为我们必须培养良好的品味。我们谈论研究品味、学术品味，关于哪些事情你应该努力去解决，哪些事情你不应该去纠结，这很难判断，因为有时候你对哪些对你职业生涯真正有用的东西没有长远的眼光。但你必须培养这种品味，是的。

我可能在和我的未婚妻或朋友谈论这个话题，就像有一个短暂的十年窗口期，所有的作业和考试都可以是数字化的，但在那之前，每个人都必须用蓝皮书（Bluebook）考试，因为没有其他方式，而现在在人工智能（A_I_）之后，每个人都需要用蓝皮书和口试，因为作弊太容易了。就像是这一代人经历了一个不同的教育系统，一切都可以数字化，但你仍然不能作弊，现在又要回到过去，这真的很有趣。

你提到了性格训练，放眼更广泛的话题。对于这个话题，需要多少计算资源（compute），一般来说，作为研究人员，有哪些地方不需要太多计算资源，你可以作为个人研究人员做出贡献？

关于性格训练，我认为这项研究是基于对七十亿参数模型进行LoRa微调（fine tuning），这基本上是只微调模型权重的一小部分。我不知道这需要多少GPU小时。

但这是可行的。

对每个学者来说并不可行。有些学者的情况如此糟糕，以至于他们唯一能做的工作就是推理，在那里你有封闭的模型或开放的模型，你可以从中获得结果并理解这些模型。这非常适合评估，你要成为在创建模型失败或展示某些能力的代表性问题方面的专家，我认为你可以通过这个取得突破。所以我认为研究人员的最高目标是

如果你想在职业生涯中获得动力，就是前沿实验室采用你的评估。所以你不需要每个项目都这样做。但如果你从一个没有计算资源的小型大学出发，发现Claude模型存在的问题，然后下一个Claude模型在博客文章中提到它，那就是你的职业火箭。我认为这很难，但如果你想在最小计算资源的情况下实现最大可能的影响，就是这样，这需要学习模型的发展方向。所以你需要

构建一个工具来测试Claude 4.5会失败的地方。如果我要开始一个研究项目，我需要考虑八个月后模型会在哪些方面遇到困难。

但开发全新的想法呢？

这是一个权衡。我认为如果你在攻读博士学位（P_H_D_），你也可以认为在语言模型方面工作风险太大，我要走更长远的路线，比如十年后会定义语言模型发展的东西。我认为我最终是一个非常务实的人。我去攻读博士学位时想的是，哦，我进入了伯克利，最坏的情况是我拿到硕士学位，然后去科技公司工作。所以我对此非常务实，我认为在这些人工智能公司工作的生活是非常有吸引力的，

比如OpenAI的员工平均年薪超过一百万美元的股票。对于美国的普通人来说，进入这个人工智能实验室对生活是具有变革性的。所以我很务实，我觉得如果你专注于语言模型，仍然有很多向上的可能性。结果就是看看这些工作。但从研究的角度来看，变革性的影响和这些学术奖项，比如成为下一个Jan LeCun，是不太关心语言模型发展的。

所以我会和一些很棒的学生一起工作，他们会问我是否应该去人工智能实验室工作，我会说，呃，比如你在顶尖学校攻读博士学位，或者你要离开去实验室，我会说我不知道，如果你去顶尖实验室工作，我不怪你。不要去某个可能会失败的随机初创公司。但如果你要去OpenAI，我觉得可能值得为了这个离开博士学位。

让我们更严格地思考这个问题。你会给人们做研究贡献的建议是什么？所以选项是学术界，获得博士学位，花五年时间发表论文，计算资源有限。有

有些研究实验室更专注于开放权重模型的工作。或者是封闭的前沿实验室，研究实验室。OpenAI和Anthropic，XAI等等。

嗯。

在我看来，权衡利弊后，我更倾向于选择高薪且有意义的工作。所以不仅仅是你在OpenAI坐着拿工资，你还在构建改变数百万人与科技关系的前沿事物。

但你是一个认知机器。

我认为老实说变化不大。我曾在学术界，现在不在学术界。同时，我不想错过我在学术界的时光，但我想说的是，我认为变化不大。我曾使用人工智能（A_I_）和机器学习方法进行应用和计算生物学研究，与合作者合作，很多人直接从学术界去了谷歌，我认为这和以前一样

教授们会因为学生进入行业而感到难过，因为他们无法在某种意义上延续他们的遗产。我认为这没有太大变化。唯一变化的是规模。但你知道，酷的东西总是在封闭的行业中开发的，你不能谈论它。我认为现在的区别在于你的偏好，你喜欢谈论你的工作并发表，还是更喜欢在封闭的实验室中工作？

这是一个区别，当然还有补偿，但我认为一直都是这样。所以这真的取决于你在哪里感到舒适，而且没有什么是永恒的。现在唯一的第三个选择是创业。很多人选择创业，这是一种高风险高回报的情况，加入行业实验室我认为相对安全，也有向上流动性。

一旦你在行业实验室工作过，未来找工作会更容易。但话又说回来，你知道，关键是你有多喜欢团队和从事专有项目的工作，还是更喜欢发表工作。发表是有压力的，会议的接受率可能是随意的，可能会非常令人沮丧，但如果你有论文发表，你会感到很高兴，因为你的名字在上面，你有很高的

而且，我的教授朋友们似乎平均比在前沿实验室工作的朋友更快乐，说实话。因为这很有根基，而前沿实验室确实是996工作制，这基本上是指一直工作。

能否描述一下“996”这种文化，我相信你可以说是中国发明的，然后被硅谷采纳。什么是“996”？就是早上9点到晚上9点，每周工作6天。这是72小时吗？那么这基本上是硅谷AI公司（人工智能公司）的标准吗？越来越多这种拼命工作的心态？

完全是这样。但我认为有一种趋势正在形成。这很有趣，我觉得这几乎是倒过来的，因为当我在学术界时，我感觉就是这样，作为教授，你必须写研究经费申请，还要教学，还要做研究。这就像是三个工作合在一起。如果你想成功，这比全职工作还要多。我觉得现在就像Nathan刚才说的，相比实验室，教授们可能压力或工作量甚至更小，因为他们工作很多，但他们非常充实。与学生一起工作，不断进行指导，并且有一个非常以人为本的使命，我认为在一个事情发展迅速且混乱的时代，这对人们来说是非常有意义的。

人们必须努力工作，这真的很重要，但这真的很难，因为你必须不断交付。我曾在一家初创公司工作过，那段时间很愉快，但我不知道自己能否永远这样下去。这是一种有趣的节奏，就像我们一开始谈到的那样。这些模型在互相超越，它们不断地试图在竞争中迈出下一步。我认为现在这种竞争非常残酷。

我认为这种互相超越的性质和有多个参与者实际上是语言建模过程的一个被低估的驱动力，竞争深深植根于人们和这些公司中。这些公司有意创造了非常强烈的文化，比如，Anthropic以其深厚的文化承诺和组织性而闻名。我们很少听到他们的消息，但每个在Anthropic的人似乎都非常一致，这就像是在一个超级紧密的文化中，拥有这种竞争动态，这会让你努力工作并创造出更好的东西。我认为这以人力资本为代价，因为你只能这样做一段时间，人们肯定会感到倦怠。我想我写过一篇关于倦怠的文章，因为我自己也曾尝试过，尤其是试图成为全模式训练的经理。这是一份疯狂的工作。Patrick McGee写的《苹果在中国》一书中，他谈到了苹果工程师为在中国建立供应链所付出的努力，他在一个播客中提到，他们有挽救婚姻的项目，他说有人因为这种高强度的工作而去世。所以我认为这就像是一个完美的环境，通过人力成本来创造进步，而这种人力成本就是我们一开始提到的996，人们真的在拼命工作。

我还写过这本书，我想他们有一个暗号，如果有人必须回家陪伴家人以挽救婚姻，同事们就会理解这是一个紧急情况，我们必须让那个人这个周末回家。但同时，我不认为他们是被迫工作的，真的，他们对产品充满热情，我想这就是你进入那种心态的原因。我有时在学术界也有这种感觉，但作为一个独立的人，我有时会过度工作，这不健康，我有过问题，我的颈部有问题，因为我没有休息，但这不是因为有人强迫我，而是因为我想工作，因为这些事情很令人兴奋。

我有幸与各种各样的人交谈，从中我可以看到世界各地的这些泡沫和回声室，看到我们人类如何形成它们是很有趣的。我想可以说硅谷是一种回声室，一种孤岛和泡沫。我认为泡沫实际上是非常有用和有效的，这不一定是负面的，因为它可能是超高效的，就像史蒂夫·乔布斯的现实扭曲场，因为你们互相说服突破即将到来，通过互相说服，你们让突破即将到来。

Bern Hooper写了一本书，分类了泡沫，其中之一是金融泡沫，比如投机，这是不好的，另一个是我不知道的术语，但实际上是为了推动建设，因为它推动人们去建设这些东西，我确实认为AI在其中，但我担心它会转变为金融泡沫。

是的，但在思想领域，这种泡沫，你是在做一个现实扭曲场，这意味着你在偏离现实。如果你偏离现实太远，同时又在工作996，你可能会错过人类体验的一些基本方面，包括在硅谷，这是硅谷的一个常见问题，因为它是一个非常特定的地理区域，你可能不了解中西部的观点，美国和世界各地其他不同人群的完整体验，你们彼此之间以某种方式交谈，说服彼此某件事情，这可能会让你陷入真正的麻烦。无论AI是一个巨大的成功并成为一种强大的技术，还是不是，在任何一种轨迹上，你都可能会陷入麻烦。所以你必须考虑到这一切。你是一个年轻人，试图决定你想用你的生命做什么。

我甚至不太理解这一点，但SFAI的梗已经到了一个地步，其中之一是永久的下层阶级，这个想法是2025年下半年的唯一时间是在AI初创公司或模型中建立持久价值，否则所有的价值将被现有公司捕获，因此你将会贫穷。这是SFAI的一个例子，我仍然认为对于年轻人来说，如果你真的热衷于在AI中产生影响，身处SFAI是最有可能实现这一目标的地方，但这有其权衡。

我认为SFAI是一个令人难以置信的地方，但确实有一点泡沫。如果你进入那个泡沫，虽然它极其有价值，也要走出来，读历史书，读文学作品，访问世界其他地方，Twitter和Substack不是整个世界。

我想我会说，我的一个同事要搬到SFAI，我需要给他们一本《巫术的季节》，这是一本关于SFAI从1960年到1985年的历史书，讲述了嬉皮士革命，比如他们所有的同性恋者如何接管城市以及那种文化的出现，然后是HIV艾滋病危机和其他事情，这一切都是如此近期，有这么多的动荡和伤害，但也有爱和SFAI，没有人知道这个，《巫术的季节》我推荐它。我的一些SFAI朋友，他们走出去后推荐给我，我认为这就像生活在那里，我住在那里，但我没有欣赏到这种背景，这一切都如此近期。

Speaker 1

是的。好吧，我们谈了很多关于去年的一些令人兴奋的事情，但今年你们提到的一个令人兴奋的事情是文本扩散模型（text diffusion models）的扩展，这是对文本扩散的不同探索。你能谈谈这是什么以及它的可能性吗？这与当前的L_M_（语言模型）有何不同？

关于transformer架构，尤其是自回归transformer架构，比如G_P_T_，这并不意味着没有人在研究其他东西，所以人们总是在寻找下一个大事物，因为我认为不这样做几乎是愚蠢的。因为当然，现在transformer架构是最有效的，目前没有其他东西，但你知道，不把所有的鸡蛋放在一个篮子里总是个好主意，所以人们正在开发其他东西，作为自回归transformer的替代方案。例如，文本扩散模型就是其中之一，听众可能知道图像生成中的扩散模型，比如u稳定扩散（stable diffusion）使其流行起来。曾经有一篇关于生成图像的论文。那时人们使用生成对抗网络（GANs）。然后有一个扩散过程，你通过迭代去噪图像，最终得到质量非常好的图像。稳定扩散是一家公司。其他公司建立了自己的扩散模型，然后人们现在想，好的，我们能否也尝试在文本上实现这一点。虽然直观上还不太合理，因为它不像像素那样是连续的，而是离散的文本，所以我们如何实现去噪过程。但这有点类似于谷歌的BERT模型，当你回到原始transformer时，有编码器和解码器。解码器是我们现在在G_P_T_等中使用的。编码器更像是一种并行技术，你可以同时填充多个标记。而G_P_T_模型是自回归的，一次生成一个标记。你一次完成一个标记的句子。在BERT模型中，你有一个句子，其中有空白。你将它们屏蔽掉，然后一次迭代填充这些空白。文本扩散有点像这样，你从一些随机文本开始，然后你填充缺失的部分或迭代地优化它们，并且有多次迭代。这里的酷点在于它可以同时处理多个标记。所以这有点像承诺让它更高效。现在的权衡当然是质量如何，它可能更快。然后你有这个去噪过程的维度，步骤越多，文本就越好。人们可以以不同的方式扩展，试图看看这是否可能是自回归模型的有效替代方案，以便在计算量更少的情况下提供相同的质量。目前我认为有一些论文表明，如果你想获得相同的质量，你必须增加去噪步骤，然后你最终花费的计算量与自回归模型相同。另一个缺点是它是并行的，这听起来很吸引人，但有些任务不是并行的。比如推理任务工具使用，可能需要你询问一个“解释器”来给你一个中间结果，这在扩散模型中有点棘手，所以有一些混合模型，但主要的想法是我们如何并行化它？所以这是一个有趣的方向，我认为目前主要是研究模型，比如Lada和其他一些。我看到一些初创公司部署了一些模型。还没有像Gemini chat G_P_D_那样大规模的扩散模型。但谷歌有一个公告，他们说他们正在推出Gemini扩散，并将其放在他们的Nano二号的背景下。他们基本上说，在大多数基准测试中，我们可以更快地生成相同质量的东西。所以你提到接下来是什么。我不认为文本扩散模型会取代自回归算法，但它可能会成为快速、廉价的大规模任务的选择。也许未来的免费层会是这样的。

Speaker 2

我听说有几个例子已经开始使用这种方法。我想举个例子说明为什么这要好得多，比如当G_P_T_五代需要三十分钟来响应时，它是一次生成一个标记，而这个扩散的想法基本上是一次性生成所有的完成标记，这就是为什么它可能快得多。我认为这可能适合我听到的一些初创公司，比如代码初创公司，你有一个代码库，有人有效地进行代码编写，他们说这个改变。代码差异基本上是模型的一个巨大回复，但不需要太多外部上下文，你可以通过使用这些扩散模型来非常快速地获得它。所以我听说的一个例子是他们使用这些文本扩散来生成非常长的差异，因为用自回归模型来做需要几分钟，而对于用户界面的产品来说，这段时间会导致大量流失。每一秒你都会失去很多用户。所以我认为这将是一个增长并有一些应用的东西。但我实际上认为不同类型的模型会更早地用于不同的事情。所以我有一个权衡。我认为工具使用这一点是阻止它们成为大多数通用用途的原因，因为像云代码和这个必须与搜索一起使用。自回归链被一些外部工具中断，我不知道如何在扩散设置中做到这一点。

那么今年和未来几年工具使用的前景如何？你认为会有很多发展吗？这将如何整合到整个技术栈中？

我确实认为现在主要是在专有的L_M_M_（大语言模型）方面，但我认为我们会在开源工具中看到更多这种情况。我认为这是一个巨大的突破，因为这样你可以真正将某些任务从记忆转移到实际的，比如不让L_M_M_记住23加5是多少，而是使用计算器。

那么你认为这能帮助解决幻觉（hallucination）问题吗？

不能完全解决，但可以减少。所以L_M_M_需要知道何时请求工具调用。第二个问题是互联网并不总是正确的。你可以进行网络搜索，但比如说我问1998年世界杯冠军是谁，它仍然需要找到正确的网站并获取正确的信息。所以你仍然可能访问错误的网站并给我错误的信息。所以我不认为这会完全解决这个问题，但它在这方面有所改善。今年早些时候有一篇很酷的论文，我认为是12月31日发表的。所以技术上不是2026年，但很接近。递归语言模型是一个很酷的想法，可以更进一步。简单解释一下，Nathan，你之前也提到过，在学术界做酷研究更难，因为计算预算有限。如果我没记错，他们用的是G_P_T_五代。所以他们甚至没有使用本地模型。但这个想法是如果我们非上下文地...

而不是让L_L_M_（大型语言模型）一次性解决所有问题，或者通过链式解决，你可以将其分解为子任务。你可以让L_L_M_决定什么是一个好的子任务，然后递归地调用L_L_M_来解决。我认为类似的方法还可以加入工具，比如你有一个庞大的问答任务，每个任务都可以上网收集信息，然后在最后将信息汇总并整合在一起。我认为通过这种方式可以解锁很多东西，不一定是改进L_L_M_本身，而是改进L_L_M_的使用方式以及L_L_M_可以使用的工具。现在使用工具的一个缺点是你必须给予L_L_M_使用工具的权限，这需要一定的信任，尤其是当你想要解锁一些功能，比如让L_L_M_为你回复邮件，或者即使不回复，也可以为你整理或选择邮件。我不知道我今天是否会让L_L_M_访问我的邮件，这似乎是一个巨大的风险。

我认为关于工具使用的最后一点很有趣。你提到过这一点，我们也都以自己的方式接触过这个问题，即开放模型与封闭模型在使用工具上的方式非常不同。开放模型的人们会去hug and face下载模型，然后他们会想“我想要什么工具”，我不知道EXA是我首选的搜索提供商，但其他人可能更喜欢不同的搜索初创公司。你发布一个模型，它需要对多种工具和多种使用场景有用，这真的很难，因为你是在制作一个通用引擎模型，这实际上是G_P_T_O_S_S_擅长的。但在封闭模型中，你会将特定工具深度集成到你的体验中。我认为开放模型在复制我喜欢用封闭模型做的一些事情时会遇到困难，比如你可以引用公共和私人信息的混合体，我每隔三到六个月就会尝试在网上使用codex，这只是提示一个模型更新我在GitHub上的某个存储库。这种安全的云环境非常方便，只需发送任务，然后返回给我。这些可能会帮助定义一些本地开放和封闭的细分市场，但我认为最初因为有如此大的压力要让这些工具使用生效，开放模型处于劣势。这是不可避免的，我认为这些前沿实验室有太多的资源，但当开放模型解决这个问题时会很有趣，因为它将会是一个更灵活且潜在有趣的模型，可能会与这个递归的想法一起工作，成为一个协调者和工具使用模型。希望这种必要性能推动一些有趣的创新。

持续学习是一个长期存在的重要话题。我认为随着模型训练成本的增加，其重要性也在增加。你能解释一下什么是持续学习，以及在今年和未来几年取得进展的重要性吗？

这与S_F_（科学幻想）中关于A_G_I_（通用人工智能）和A_S_I_（超级人工智能）的时代精神有很大关系，以及我们今天拥有的语言模型能够做些什么。我认为语言模型可以解决很多任务，但在A_I_（人工智能）社区中，一个关键的里程碑是A_I_何时能够取代任何远程工作者，接收信息并解决数字任务。人们指出的一个限制是语言模型不会像员工那样从反馈中学习。如果你雇佣了一名编辑，编辑会出错，但你会告诉他们，如果你雇佣了一名优秀的编辑，他们就不会再犯同样的错误。但语言模型没有这种快速自我修改和学习的能力。因此，如果我们真的要实现一种真正的通用适应性智能，能够进入任何远程工作场景，它需要能够快速从反馈中学习和在职学习。

我个人对语言模型更有信心，只需为它们提供非常好的上下文。你可能在离线时说过，你可以为模型编写详细的文档，告诉它们“我有所有这些信息，这是我写过的所有博客文章，我喜欢这种写作风格，我的语气基于此”，但很多人没有向模型提供这些信息，而模型以前也不是为了接受这么多上下文而设计的，像agentic模型才刚刚开始。所以这是一个权衡问题，我们是否需要通过持续学习来改变模型的权重以使其快速学习。或者相反的观点是，我们只需要为它们提供更多的上下文和信息，它们就会通过拥有大量上下文和聪明的表现来显得学习得很快。

我们应该提到这里的术语。持续学习是指不断改变权重，使模型根据新的输入信息进行调整，持续快速地进行。而你提到的另一种方式通常被称为上下文学习。你学习的东西有一个巨大的上下文窗口，每次提示系统时都可以不断加载额外的信息。我认为这两者都可以被视为学习，只是学习发生在不同的地方。

我们已经在不同的风格中拥有权重的变化。如果你考虑一下，我认为这里的区别在于你是为每个人定制个性化模型，还是在全球模型规模上进行。我们已经有了从G_P_D_五到五点一和五点二的变化，这可能不是立即的，但它是一个快速的策划更新，基于反馈，社区反馈，更新权重到下一个模型等等。所以这有点像这种风格。另一个更细粒度的例子是R_R_L_V_R_，你运行它，它会更新。问题是你不能为每个人都这样做，因为为每个人更新权重太昂贵了。我认为这就是问题所在。即使在开放AI规模上建立数据中心，我认为成本也太高了。这只有在设备上实现时才可行，比如苹果尝试通过苹果基础模型将其放在手机上，然后从经验中学习。

这是一个相关的话题，但这有点拟人化的术语，记忆。如何为这些系统添加记忆的不同机制是什么？你是否越来越多地看到个性化记忆？

这主要是上下文，基本上是将东西塞入上下文中然后回忆。但这很昂贵，因为你可以缓存它，但仍然需要花费代币。其次，你只能做到有限的程度。我认为这更像是一种偏好或风格。很多人在解决数学问题时会这样做。你可以添加以前的知识和东西，但你也会给出某种偏好。

泵做我上次喜欢的事情，随便，类似这样的东西。但它并没有解锁新的能力。为此，有些人仍然使用LoRa（低功耗广域网）适配器。这些基本上不是更新整个权重矩阵，而是有两个较小的权重矩阵，你可以并行或叠加使用。就像是增量。但是，是的，你可以在某种程度上做到这一点，但归根结底还是经济问题。所以也有论文提到

Laura学得少，但忘得也少。就像你知道的，没有免费的午餐。如果你想学得更多，你需要使用更多的权重，但这会变得更贵。而且如果你学得更多，你也会忘得更多。这就像你必须找到那个合适的平衡点。

我们其实没有多提到，但在这个讨论中隐含的是上下文长度。那里也有很多创新的可能性。

我认为通常被接受的观点是，这是一个计算和数据问题，有时是一些小的架构问题，比如注意力变体。如果你有，我们谈过混合注意力模型，这基本上是如果你在你的transformer（转换器）中有一个看起来像状态空间模型的东西，那么这些模型更适合，因为你需要花费更少的计算来建模最远的token（标记）。我认为那些不是免费的，因为它们需要大量的计算或合适的数据。那么世界上有多少十万token的序列，你从哪里获得这些？我认为扩展它们最终会变得相当昂贵。所以我们已经很快达到了大约一百万token的输入上下文长度。我预计它会继续增加，比如今年达到两百万或五百万。但我不期望它达到一亿。这将是一个真正的突破。我认为那些突破是可能的，比如持续学习的问题

把它看作一个研究问题，可能会有一个突破，使transformers在这方面工作得更好，而且成本低。这些事情可能会随着大量的科学关注而发生，但随着时间的推移，它将是稳定的增加。

我认为也要看极端，我认为再次没有免费的午餐。所以一个极端是为了便宜，你有一个RNN（循环神经网络），它有一个单一的状态，你保存所有以前的东西。这就像一个特定的固定大小的东西。所以你永远不会真正增加内存，因为你把所有东西都塞进一个状态。但上下文越长，你忘记的信息就越多，因为你不能压缩所有东西到一个状态。然后在另一端，你有transformers，它们试图记住每个token。

这有时很好，因为我们想查找特定信息，但非常昂贵，因为你有不断增长的KV缓存和点积。但正如你所说的MAMBA层，我认为它们有同样的问题，我会说像一个RNN，你试图压缩所有东西到一个状态，你在那里更具选择性，但我认为这就像这个合适的平衡点，Unimotron三号找到了一个好的比例，多少注意力层是你需要的，用于全局信息，所有东西都是可访问的，与拥有这些压缩状态相比。

我认为这就是我们将通过在合适的平衡点中找到更好的比例来扩展，比如在计算上，使其足够便宜以运行，但同时也使其足够强大以有用。再补充一点，递归语言模型论文，这是试图解决长上下文问题的论文之一。他们发现的是，基本上不是把所有东西塞进这个长上下文中，而是把它分解成多个较小的任务，这样你通过多次较小的调用来节省内存。你实际上可以获得比让LM（语言模型）一次性尝试所有事情更好的准确性。我是说这是一种新范式。我们将看到，可能会有其他的变体。所以我认为通过这种方式我们仍然会在长上下文上取得进展，但正如Nathan所说，我认为问题在于预训练本身，我们没有像其他文档那样多的长上下文文档，所以研究起来更困难。

有一些经验法则，基本上你预训练一个语言模型。虽然我们在8K上下文长度下预训练，然后通过训练扩展到32K。有一些经验法则，你基本上是将训练上下文长度加倍需要两倍的计算，然后你通常可以将上下文长度再增加两到四倍。所以我认为很多最终在预训练中是计算受限的，在这个链接中我们谈到了这一点，每个人都在谈论今年顶级实验室的计算大幅增加，这应该反映在一些更长的

但我认为在后训练方面有一些更有趣的事情，因为我们有代理，代理将自行管理这个上下文，现在使用CLOD代码的人非常害怕压缩，即CLOD将其整个十万token的工作压缩成项目符号列表。但下一个模型会做的，我确信人们已经在研究这个，基本上模型可以控制何时以及如何压缩。所以你可以基本上训练你的RL（强化学习）算法，其中压缩是一种行动，它缩短历史，然后问题的制定将是我想保持我获得的最大评估分数，而模型将其历史压缩到最小长度，因为这样你就有了进行这种复合自回归预测所需的最少token数量。所以在这方面实际上有一个非常好的问题设置，这些代理模型学会以不同于仅仅向前推进的方式使用它们的上下文。

一个最近有趣的例子是deep seek版本3.2，他们有稀疏注意力机制，他们有一个非常高效的小型轻量索引器，而不是关注所有token，它选择我实际上需要的token。这几乎回到了注意力的原始想法，你是有选择性的，但注意力总是在你身上，可能有些token的权重为零，但你使用它们所有，但他们甚至更像是好吧，让我们把它屏蔽掉或根本不这样做。即使是滑动窗口注意力，也几乎是这样的想法。你有一个滚动窗口，你保持固定，因为你不需要一直所有东西。偶尔有些层你可能需要，但这很浪费。但现在我认为是的，如果你使用所有东西，你是在安全的一边，它给你最好的性价比，因为你永远不会错过信息。我认为今年将更多地是弄清楚如何更聪明地做到这一点。现在人们想要下一个最先进的技术，而最先进的技术碰巧是

蛮力昂贵的东西。一旦你有了那个，正如你所说，保持那个准确性，但让我们看看我们现在如何更便宜地做到这一点，比如一些技巧。

是的，所有这些扩展的东西。我们首先得到Claude 4.5 sonnet模型的原因是因为你可以更快地训练它，你不会很快遇到这些计算瓶颈，他们可以尝试更多的东西并更快地获得模型，即使更大的模型实际上更好。

我认为我们应该说，A_I_（人工智能）领域有很多令人兴奋的事情正在发生。最近我一直在关注机器人技术。今天我们几乎完全没有讨论机器人技术。在图像生成和视频生成方面有很多东西。我认为可以公平地说，最令人兴奋的研究工作在于L_L_M_（大型语言模型）领域，这也是为什么我们专注于讨论L_L_M_的原因。但如果能引入一些可能有用的东西就更好了，比如世界模型（world models），这方面的兴奋度在不断增长。你认为在即将到来的一年里，世界模型在L_L_M_领域会有用吗？

Speaker 2

是的，我认为会有用。关于L_L_M_，有趣的是，如果我们解锁更多L_L_M_的能力，它也会自动加速其他领域的进展，因为很多研究人员和工程师使用L_L_M_来编程。所以即使他们在研究机器人技术，如果你优化这些帮助编程的L_L_M_，它也是有回报的。世界模型很有趣，基本上是让模型在某种意义上模拟世界，就像一个小玩具版的真实世界，这可以解锁L_L_M_未知的能力。L_L_M_通过预训练和下一个标记预测工作得很好，但我们可以做得更复杂一些。我记得有一篇关于世界模型的论文，他们将世界模型的概念应用于L_L_M_，不仅仅是下一个标记预测和验证奖励，还确保中间变量的正确性。模型基本上在学习一个代码环境，我认为这很有意义，只是成本较高，但它确实让事情变得更复杂，比如不仅仅建模结果，而是整个过程。

我记得我还是研究生的时候，有一个叫CASP的比赛，他们预测蛋白质结构，预测尚未解决的蛋白质结构。这在某种意义上是很棒的，我认为我们需要为L_L_M_做类似的事情。你进行基准测试，但没有人知道答案，之后有人揭示答案。当AlphaFold出现时，它在这个基准测试中表现出色。虽然有多次迭代，但我记得第一次显式地建模了分子的物理相互作用，比如不可能的角度。后来他们在下一个版本中去掉了这些，只是通过蛮力扩大规模。我认为L_L_M_目前处于这种蛮力扩展阶段，因为它恰好有效。但我认为在某个时候，重新引入这些东西是有意义的，我认为世界模型可能会非常酷。当然，这也适用于机器人技术，与L_L_M_完全无关。

Speaker 1

是的，在机器人技术中，运动和操作的问题非常明确。运动在学习领域更加成熟，但就像最初的蛋白质折叠系统一样，传统的基于模型的方法也有很大价值。你不太可能仅通过学习来解决操作或全身局部操作问题，这是个梦想。但当你看到人类手的神奇和真实世界的复杂性时，你会意识到很难完全学会这一切，就像AlphaFold 2没有做到的那样。

我对机器人学习领域感到兴奋，我认为由于语言模型的兴奋和投资，整个领域得到了极大的推动。训练transformers（转换器，一种通用建模工具）的基础设施正在成为世界级的工业工具，这在过去是机器人技术的限制，但现在变得更好了。然后，他们将这些语言模型用作中心单元，可以进行有趣的探索。我看到它正在出现，就像我们谈到的Hugging Face的transformers和Hugging Face。我在Hugging Face的时候试图推动这一点，但当时还为时过早。现在有开放的机器人模型在Hugging Face上，人们可以贡献数据并微调它们。我认为我们离这一点更近了，机器人技术的投资和自动驾驶汽车的相关性使得这一点成为可能。一旦你能有这样的生态系统，某人可以下载一个机器人模型并微调到他们的机器人上，或者在全球范围内共享数据集，这将是一个很大的进步。几年前有一些工作在这个领域，比如R_T_X_，人们试图做到这一点。我认为一旦有了这个生态系统，情况会大不相同，而Chapp G_ G_ B_T_热潮后投入的资源也有助于这一研究领域的发展。

这也导致了更好、更准确、更现实的模拟器的构建，缩小了机器人领域的模拟与现实之间的差距。但你提到机器人领域的兴奋和投资，炒作周期的负面影响是，大多数机器人专家认为机器人技术不会在被隐含或明确承诺的时间尺度上得到解决。当所有这些机器人公司涌现出来却没有一个有效的产品时，兴奋会崩溃，这令人紧张。希望有其他东西能继续进来，保持这些想法的持续发展。

这也与持续学习问题有关，现实世界非常复杂，而L_L_M_不需要为用户学习太多东西，因为有很多事情是每个人都需要做的，比如修正电子邮件中的语法或编程，这些是更有限的，你可以为此准备模型。但为现实世界准备机器人更难，虽然有基础模型，机器人基础模型，但你可以学习某些事情，比如抓取物体，但每个人的家都不同，这就是机器人需要在工作中学习的地方。我认为这就是目前的瓶颈，如何在现场定制机器人。

Speaker 1

我认为几乎没有人会低估一个几乎不被机器人领域或其他任何人谈论的重要问题，那就是安全性。我们讨论的所有有趣的学习复杂性，所有的故障模式和故障案例，我们在L_L_M_（大语言模型）中谈论的一切，有时它会以有趣的方式失败，所有这些在L_L_M_领域都是乐趣和游戏，但在机器人领域，在人们的家中，跨越数百万分钟和数十亿次交互中，你几乎不被允许失败。当你将具体现实世界中的系统投入使用时，你必须解决许多你从未想过要解决的问题，仅仅是思考一般的机器人学习问题时。

所以我对家庭学习机器人用于消费者购买持悲观态度。我非常看好自动驾驶汽车，也非常看好机器人自动化，例如像亚马逊的分销那样，亚马逊建立了全新的分销中心，首先为机器人而不是人为设计。

Speaker 2

嗯嗯。

Speaker 1

机器人实现这一目标的路径更为合理，它是为执行人类可以想象但不愿意做的重复任务而设计和优化的东西。但这也将比人们可能预测的要花费更长的时间。我认为从A_I_（人工智能）奇点到我们现在可以在美国大规模制造，因为我们拥有巨大的A_I_优势，这一飞跃受到许多政治和其他挑战性问题的困扰。

让我们谈谈时间表。特别是到达A_G_I_（通用人工智能）或A_S_I_（超级人工智能）的时间表。

Speaker 2

作为起点说没人真正同意A_G_I_和A_S_I_的定义，这样说公平吗？

Speaker 1

我认为有很多分歧，但我一直受到一些人的反对，他们大多说同样的事情，比如可以重现大多数数字经济工作的东西。所以像远程工作者是一个相当合理的例子，我认为Open A_I_的定义与此相关，即一个A_I_可以执行许多经济上有价值的任务，我不太喜欢这个定义，但我认为它可以作为一个基础点，因为

今天的语言模型虽然非常强大，但并不是用于远程工作或替代的，还有一些你可以想到的事情是A_I_可以完成的，比远程工作要困难得多，比如找到你甚至无法想象的意外科学发现，这将是一个人工超级智能问题的例子，或者像分析所有医疗记录并找到某些疾病之间的联系，

不知道或者发现某些常见药物可以治疗某些小众癌症。他们会说这是一个超级智能的事情。所以这些是自然的层次。我的问题是它与A_I_的意义追求和这些宗教方面深深交织在一起。所以你可以走不同的路径。

我甚至不知道远程工作是否是一个好的定义，因为那到底是什么？就像完美的工具使用。我其实我不知道你是否喜欢最初命名的A_I_二十七报告。他们更关注代码和研究品味。所以目标是超级人类编码者。他们有几个里程碑系统。超级人类编码者，超级人类A_I_研究员，然后是超级智能A_I_研究员和完整的

人工超级智能。但在你开发出超级人类编码者之后，其他一切都会迅速发生。任务是拥有一个完全自主的自动化编码。所以为了进行研究所需的任何编码都是完全自动化的。从那里人类将与该系统一起进行A_I_研究，他们将很快能够开发出一个实际上可以为你进行研究的系统。这就是想法。然后最初他们的预测是二零二七到二零二八年，现在他们将其推迟了三到四年，到二零三一年中期预测。可能我的预测甚至超过二零三一年，但至少你可以以具体的方式思考完全自动化编程的难度。

Speaker 2

是的，我不同意他们的一些假设和动态，关于它将如何发展。但我认为他们在定义具体里程碑和讲述有用的故事方面做得很好，这就是为什么这个A_I_二零二七文件的影响远远超出了硅谷，因为他们讲了一个好故事，并做了大量严谨的工作。我认为我所在的阵营是A_I_是所谓的参差不齐的，它在某些事情上会非常出色，而在某些事情上会非常糟糕。所以我认为

当他们接近这个自动化软件工程师时，它在传统的M_L_（机器学习）系统前端表现出色，但在分布式M_L_方面表现得非常糟糕，因为在进行大规模分布式学习方面的训练数据非常少。这是我们已经看到的，我认为这些问题只会被放大。然后在这些权衡中会变得更复杂，还有你如何看待A_I_研究的运作等等。

Speaker 1

所以你基本上认为超级人类编码者几乎是无法实现的，意思是因为事物的参差不齐的性质，你总是会有能力上的差距。

Speaker 2

我认为这是在赋予某种完整性，而模型在某些类型的代码上是超人类的，我认为这种情况会继续。人们很有创造力，所以他们会利用这种不可思议的能力来弥补模型的弱点并快速推进。长期以来，这种情况总是会是人类在启用模型无法做到的事情，最好的A_I_研究员是那些能够启用这种超能力的人。我认为这与我们已经看到的情况一致。我认为像云代码用于构建一个

你可以在几个小时内建立一个漂亮的网站或进行数据分析。我不认为它会在这些事情上不断变得更好，并且它会在过程中获得一些新的编码技能。并且与大科技公司正在发生的事情相关的是，这个A_I_二零二七报告倾向于奇点的想法，我认为研究是混乱和社会性的，很大程度上在数据中以A_I_模型无法处理的方式进行。但我们今天拥有的东西确实非常强大，

科技公司都在集体投入数百亿美元的投资。所以我们会得到比我们已经拥有的更好的聊天G_P_T_版本，更好的云代码版本。我认为很难预测这将走向何方，但对未来的明亮清晰是为什么世界上一些最有权势的人投入如此多资金的原因。我认为这只是一些小的差异，比如我们实际上不知道更好的聊天G_P_T_版本是什么，但也

它能自动化A_I_研究吗？我会说至少在这个时间框架内可能不会。大科技公司将比我们更快地花费一千亿美元来获得一个自动化的A_I_研究员，从而实现A_I_研究奇点。

Speaker 1

所以你认为你的预测是什么？如果这是一个有用的里程碑，我们距离还有十年以上。

Speaker 2

我会说在软件方面不到十年，但在像研究这样的事情上会更长。

Speaker 1

好吧，就像为了好玩，试着想象一个所有软件编写都完全自动化的世界。你能想象那个世界吗？

到今年年底，自动化的软件数量将会非常高，但就像你试图用R_L_（强化学习，Reinforcement Learning）训练一个模型，需要多个G_P_U_（图形处理单元，Graphics Processing Unit）相互通信，这仍然很难，但我认为会容易得多。

一种思考这个问题的方法是，完全自动化编程可以被视为有用代码行数与参与其中的人数的比例。显然，软件编写中长期会有人参与，只是相对于编写的代码量来说，人会越来越少。而S_C_（超级代码，Superhuman Code），我认为假设是参与的人数会趋于零。当参与人数只有几百而不是几十万时，世界会是什么样子？

我认为软件工程将更多地转向系统设计和结果目标。我确实认为软件在很大程度上会变得更普遍。我认为在过去几周里，人们从一个月前的“哦，是的，代理有点随意”（这是一句著名的木工引用）转变为一种“软件工业化”的小趋势，任何人都可以轻松创建软件。我确实认为我们更接近这种情况，这需要方向和对系统如何工作的理解，以便从语言模型中提取最佳效果。我认为很难接受软件开发将发生多大变化，以及有多少人可以在不亲自查看的情况下完成事情。

有趣的是，思考这些系统是否会完全独立，我毫不怀疑L_M_S_（语言模型系统，Language Model Systems）在某种程度上会解决编码问题，就像计算器解决计算问题一样。所以在某个时候，人类开发了工具，你不再需要人来计算那个数字，你只需输入它，然后它就是一个算法，你可以用这种方式完成。我认为编码也是如此，但问题是，我认为会发生的是，你只需说出那个网站，它就会制作一个非常好的网站，然后你可能会对其进行微调。但它会独立做事情吗？还是仍然需要人类让A_I_（人工智能，Artificial Intelligence）做某事？比如会不会有人说建一个网站？还是会有A_I_自己建网站或其他东西？

我认为谈论建网站的问题在于，网站和网络的问题，H_T_M_L_（超文本标记语言，Hypertext Markup Language）等等，非常容易展示出随意性。它会像展示随意性一样展示给你。我宁愿考虑像安全关键系统这样的事情，比如让A_I_端到端生成管理物流或管理车辆和车队的东西。所以端到端为你生成东西。

我认为一个更中间的例子是像Slack或Microsoft Word这样的东西。我认为如果组织允许，A_I_可以非常轻松地端到端实现功能，并在你想尝试的事情上做得相当好。比如你想在Slack中添加一个新标签，我认为A_I_能够很好地做到这一点。

实际上，这是一个很好的例子。我们离这个有多远？

就今年。

看，我不知道。我不知道。我不知道生产代码库有多糟糕，但我认为在几年内，很多人会被推向更像设计师和产品经理的角色，你有多个代理可以为你尝试事情，他们可能需要一到两天来实现一个功能或尝试修复一个错误，你有这些仪表板，我认为Slack实际上是一个很好的仪表板，你的代理会和你交流，然后你会给出反馈。但像我做一个网站这样的事情，比如你想做一个标志，我认为这些连贯的设计事情和风格对于模型来说会非常困难，并决定下一次要添加什么。

我只是，好吧，我和很多程序员在一起，其中一些人总体上有点怀疑。他们的氛围就是这样。

我只是认为在复杂系统中添加功能涉及很多复杂性。比如如果你看浏览器，Chrome，

嗯。

如果我想添加一个功能，比如我想要标签在左侧而不是顶部。界面，对吧。我认为这不是明年的事情。

今年Claude的一个版本发布时，其中一个测试是我们给它一段软件，让Claude完全重新创建它。它几乎可以在沙盒环境中根据软件的参数从头开始重建Slack。

嗯。所以可能是较小的新公司有优势，他们不必有臃肿和复杂性，因此这种未来存在。

规格问题。编程就像你假设L_M_（语言模型，Language Model）应该读懂你的想法一样。我认为这就是规格驱动设计非常重要的地方。你只需用自然语言指定你想要的东西。

我认为这就像如果你和实验室的人交谈，他们在他们的训练和生产代码中使用这些东西。像CLOD代码是用CLOD代码构建的，他们都广泛使用这些东西，Dario谈到CLOD代码的多少，以及他们在能力方面稍微领先，他们可能在推理上花费的费用是我们的十到一百倍。我们可能每月只花一两百美元。他们真的放手一搏。我认为随着我们进步的速度，感觉就像一年前我们没有云代码，也没有真正的推理模型，而现在我们可以用这些模型做的事情之间的差异，感觉有很多可以改进的低垂果实。失败模式相当愚蠢。就像云，你试图使用没有安装的C_L_I_（命令行界面，Command Line Interface）命令十四次，然后我给你发送了要运行的命令，从建模的角度来看，这个东西相当灵活。所以我

我同意你的看法。我变得越来越乐观。说到你所表达的，我认为这是一个人类技能问题。所以Anthropic在理解如何最好地使用模型进行编程方面领先于其他公司，因此他们有效地使用它们。我认为有很多程序员在边缘，他们没有一个真正好的指南来使用它们。人们正在努力弄清楚

这可能非常昂贵。可能是入门费用是每月两千美元，这只有科技公司和富人能负担得起。就像那样。

但这可能值得。我的意思是，如果最终结果是一个有效的软件系统，那可能值得。不过，顺便说一句，我们从讨论时间线到A_G_I_（通用人工智能，Artificial General Intelligence）到更务实和有用的东西的收敛很有趣。有没有什么具体、有趣、有用和深刻的东西可以说关于A_G_I_和A_S_I_（超级人工智能，Artificial Superintelligence）的时间线？还是这些讨论有点脱离日常？

有一些有趣的赌注。很多人尝试在真正的科学领域中进行带有可验证奖励的b强化学习（reinforcement learning），这些领域的初创公司可能获得了数亿美元的资金，并且拥有湿实验室（wet labs），在这些实验室中，语言模型提出的假设会在现实世界中进行测试。我会说，我认为他们还处于非常早期的阶段，或者说他们确实早期，但随着进展的速度，也许他们只是早了六个月，因为他们是第一个进入这个领域的，或者他们早了八年，所以你真的不知道。所以我认为，将这种势头扩展到其他科学领域会非常具有变革性，就像如果某个初创公司在各类科学领域中实现了AlphaFold时刻（AlphaFold是一个用于预测蛋白质结构的人工智能程序），我认为有些初创公司可能会全力投入语言模型加上数学的Lean（Lean是一个用于数学证明的交互式定理证明器）。我记得你在另一个播客中谈到过这个话题，我们不知道花费一百万美元在这个模型上会有什么结果。大多数会失败，但其中一些可能会成为与CHI G_V_T_或云代码类型软件体验截然不同的重大突破。比如一个只对P_H_D_数学家有用的工具，但能让他们的效率提高一百倍。

好的，我同意。我认为这将在很多领域发生，尤其是在那些拥有大量资源的领域，比如金融、法律和制药公司。但话又说回来，这真的是A_G_I_（人工通用智能）吗？因为我们现在又在专门化。然后再说，这和以前我们拥有的专门算法有什么不同吗？我认为这只是同样的事情，只是更加复杂，但我不知道是否有一个门槛来称其为A_G_I_。我想真正酷的地方在于我们有了可以专门化的基础模型。我认为这是目前的突破点。我认为我们还没有到达那个阶段，因为首先，这太昂贵了，而且你知道，J_G_P_D_不会轻易让你定制它。我想一旦这成为现实，我可以想象这会成为一种商业模式，比如J_G_P_D_可能会在某个时候说，嘿，美国银行，给我们一亿美元，我们会为你定制一个模型。我认为这将带来巨大的经济价值。

不过，另一个问题是公司，现在的差异化因素是什么？如果每个人都使用相同的L_L_M_，如果每个人都使用G_P_D_，他们又会做同样的事情。每个人都在同步前进，但通常公司希望拥有竞争优势，我认为无法避免的是使用他们的一些私人数据进行实验和可能的专门化。这将会很有趣。

在进展的速度下，感觉事情正在到来。我不认为A_G_I_和A_S_I_（人工超级智能）的门槛特别有用。我想真正的问题是，什么时候我们会看到一个显著的经济影响的飞跃？因为目前还没有看到L_L_M_模型在经济影响方面有明显的飞跃。这与A_G_I_或A_S_I_无关，真正的问题是，我们什么时候会看到类似G_D_P_（国内生产总值）的增长？

嗯嗯，是的，G_D_P_由什么组成？很多是金融服务，所以我不知道这是什么。对我来说，很难想到G_D_P_的增长，但我会说，当你不再需要查看代码时，软件开发会以不同的方式变得有价值。所以当它像云一样能让你成为一个小企业，云可以为你设置网站、银行账户、电子邮件等等，而你只需表达你想要在世界上实现的东西，这不仅仅是一个企业市场，但我不知道如何让人们尝试这样做。我想如果Chad G_P_T_能做到，人们会尝试Chad G_P_T_。

我认为这归结于一个科学问题，即工具使用的难度有多大。你提到的很多东西，远程工作等，都是工具使用的问题。就像计算机使用一样，比如你有一个L_M_，它是一个代理系统，在世界上执行任务，并且只有百分之一的时间会出错。计算机使用是实验室关心的一个很好的例子，我们在这方面没有看到太多进展。我们在2025年看到过多个演示，比如云可以使用你的计算机，或者OpenAI有Kua，但它们都不太好。所以他们也在这方面投资，并认为这将是一个很好的例子。模型在这方面似乎工作得很好。它们不是在你的Macbook上工作。它们是分别与Google、Amazon和Slack进行接口处理，并以与人类非常不同的方式处理所有这些事情。所以其中一些可能是结构性障碍。

另外，从规格上来说，我认为问题在于对于任意任务，你仍然需要指定你希望L_L_M_做什么，你如何在一个环境中做到这一点，你可以说最终目标是什么，但如果它无法用L_L_M_s解决最终目标，如果你要求它生成文本，你总是可以澄清，做子步骤，如何将这些信息输入一个系统，比如为你预订旅行，你可以说它搞错了我的信用卡信息，但即使要达到那一步，你作为用户如何引导模型在它尝试之前。我认为界面真的很难。

是的，它必须对你有很多了解，并且继续学习关于整个过程中犯的一般错误以及你犯的错误。

嗯嗯。

参与。有些人真的很喜欢这个脉搏功能，它会处理你的聊天并自动搜索信息并将其放入chat G_B_T_应用程序中。所以在这方面有很多事情正在发生。

我用过这个功能，我总是感到不好意思，因为它每天都这样做，而我很少查看它。就像有多少计算资源被浪费在我甚至不看的东西上，你知道，就像旧的f一样，当然。好的。你

可能需要新的想法。是否有可能通往A_G_I_的道路，无论那是什么，无论如何定义它，要更普遍地解决计算机使用问题，解决生物学、化学和物理学问题，类似于Dario对A_G_I_或强大A_I_的定义，你认为是否可能需要完全新的想法？非L_L_M_、非R_L_的想法。它们可能是什么样子？这有点像进入哲学领域。

对于奇点这样的事情，我会说是的。新的想法可能是架构或训练算法，这些是深度学习的基本问题。但在这种性质上很难预测，但我认为即使没有这些进展，我们也会走得很远。我们可能会得到这个软件解决方案，但可能在没有更多创新的情况下无法解决计算机使用问题。所以我认为会有很多进展，但如果你放大来看，未来三十年仍然会有一些想法看起来像是一个重大科学创新，推动了下一章的发展，我不知道它会在一年内还是十五年内出现。

Speaker 1

是的，我想知道这个苦涩的教训在未来一百年是否仍然适用，这会是什么样子。如果深度学习中的扩展定律是基本的，我认为这个苦涩的教训将永远适用，也就是说计算能力将变得更加丰富，但即便在丰富的计算能力中，那些具有更陡峭的扩展定律斜率或更好偏移的模型会胜出。这就像是一个关于性能和计算的二维图，即使有更多的计算能力可用，那些能从中获得一百倍提升的模型将会胜出。可能会出现类似于绕地球轨道运行的计算集群。太阳能电池板。问题在于热量散发。你会从太阳那里接收到所有的辐射，但没有空气来散热，不过有很多空间可以放置集群。那里有很多太阳能，你可以想办法解决热量散发的问题，可能会有工程意愿来解决热量问题，所以这可能会实现。

Speaker 2

这是否可能，我们应该说这绝对是可能的，问题在于它有多大可能性，我们基本上将在今年达到一个平台期。不是系统能力方面，而是系统能力对人类文明的实际意义。在编码方面，会有非常漂亮的网站被构建出来。非常好的自动补全，非常好的理解代码的方式，可能帮助调试，但实际上只是一个非常好的编码助手。它可以帮助研究数学家做一些数学工作。它可以帮助你购物，它可以帮助你做很多事情，是一个很好的助手，就像是“Clippy on steroids”（Clippy 是微软 Office 助手的名字，意指功能更强大的助手）。还有什么呢？它可能是一个很好的教育工具，等等，但计算机的使用却非常难以解决。所以我试图在所有这些领域中框定临床案例，虽然没有巨大的经济影响，但我们意识到训练这些系统在每个层面上是多么昂贵，包括预训练和推理，推理的成本，所有这些。你怎么看这种可能性和可能性有多大？

Speaker 1

你看看这些模型，有很多明显的地方可以改进，训练这些模型需要很长时间，做这项艺术需要多年时间才能在我们追求的任何基准或性能上达到饱和，它可能服务于非常狭窄的领域，比如平均策略 B_T_ 八百万用户可能不会从中获得很多好处，但它会通过在不同方面变得更好来服务不同的人群。但我认为现在每个人追求的是一个对每个人都有用的通用系统。那么，如果这不是可以达到的平台，对吧？

Speaker 2

我认为这个梦想实际上有点在消亡。正如你所说的那些专门化的模型，比如 t 和多模态通常是一个 t，比如视频生成是完全不同的事情。说这个梦想正在消亡是一个很大的声明。因为我不知道它是否正在消亡。我不知道如果你问实际的金融实验室的人，他们仍然在追求它，对吧？

Speaker 1

我确实认为他们仍然在努力推出下一个模型，这个模型将比之前的模型更好，不仅仅是相对的术语，而是比之前的模型更好，我看不出他们会放慢脚步。我只是认为收益将通过不仅仅是扩展模型而感受到，现在我觉得有很多技术就像是“让我们把更好的模型放进去，更好的模型和更好的模型”，现在人们也在同时改进周围的一切，比如上下文和推理扩展的工程，大实验室仍然会继续这样做，现在小实验室也会赶上，因为他们正在招聘更多的人，L_L_M_s（大型语言模型），这有点像一个循环，它们也让他们更有生产力，这就像是放大。我认为我们可以预期的是放大，而不是范式的改变，我不认为那是真的，但一切都会被放大和放大，我看不出这种情况会持续很长时间。

Speaker 2

是的，我想我关于梦想正在消亡的声明取决于你认为它将做什么。比如云代码是一个可以做很多事情的通用模型，但它不一定是这样的，它在很大程度上依赖于集成和其他东西。我打赌云代码可以很好地处理你的电子邮件，最难的部分是弄清楚如何将信息提供给它以及如何让它能够发送你的电子邮件等等。但这只是我认为它回到了“统治一切的一个模型”的理念，即云中的一个东西处理你的整个数字生活，比任何人都聪明。它就像在操作一个

Speaker 1

从云代码变成那样是一个有趣的信念飞跃，我在某些方面喜欢这样，但我确实认为行业的言辞有点不同。我认为我们作为普通人使用 L_L_M_s 的下一个直接感受可能会与一些琐碎的事情有关，比如制作图形。现在 L_L_M_s 在制作图形方面很糟糕。是不是因为我们得到的是计算量很少的廉价模型？也许有一些调整我们已经可以得到更好的图形。但是如果你今天问我，画一个 X_Y_Z_ 的流程图，大多数时候都是糟糕的。这对人类来说是一个非常简单的任务。我认为有时候画东西比写东西更容易。

Speaker 2

是的，多模态理解确实感觉像是一个奇怪的问题，它没有得到更好的解决。我认为我们没有说出一个实际上显而易见的事情，我们没有意识到那是一个巨大的、难以衡量的事情，那就是让全人类的知识对全世界都可访问。我觉得我可以基本上问 L_M_ 任何问题并得到答案。它正在减少幻觉。这意味着理解我自己的生活，找出职业发展方向，解决我周围的问题，了解人类历史上的任何事情，我觉得没有人真正谈论这个，因为他们只是理所当然地认为这很棒。这就是为什么每个人都在使用它，因为你可以得到问题的答案。

Speaker 1

这种影响是跨时间的。想想这不仅仅是在美国，而是在全世界，比如全世界的孩子能够学习这些想法，这种影响是巨大的，那就是我们谈论的真正的“G_D_P_”（国内生产总值），这不会是一个飞跃，那就是我们如何到达火星，我们如何建造这些东西，我们如何拥有一百万个新的开放 A_Is_（人工智能），所有从那里发生的创新。这是一个渗透一切的安静力量，人类的知识。

Speaker 2

我同意你的看法，在某种意义上，它让知识更容易获得，但我认为这也取决于主题是什么。对于像数学这样的东西，你可以问它问题，它会回答，但如果你想从头开始学习一个主题，我认为我们之前谈到的那个甜蜜点是，有一些非常好的数学教科书，有人线性地把它们排列出来，那是一种经过验证的学习策略。

主题。如果你从零开始，逐步积累信息密集的文本来吸收它，这确实是有道理的。但随后你可以使用L_L_M_（大型语言模型）来生成无限的练习题。比如你在某个领域遇到问题，或者对某些事情感到不确定。你可以让它生成示例问题，然后你去解决它们，如果有问题，可能需要更多背景知识，你可以让它生成这些知识。我认为，它不会给你提供任何不在教科书中的东西，只是以不同的方式包装它。但在某些方面，我觉得它也能在更及时的意义上增加价值，比如除了人类即时处理之外没有更好的替代方案。例如，如果你计划去迪士尼乐园，想弄清楚在什么时候买哪个公园的票，没有教科书可以参考，也没有信息密集的资源，只有稀疏的互联网信息。在这种情况下，L_L_M_的价值很大。你只需询问它，它会根据你的约束条件，比如你旅行的日期、想去的地方，帮你弄清楚需要什么、从哪里买、费用是多少等等。这是一个非常定制化的即时方案，这只是成千上万个个性化例子中的一个。个性化本质上就是从稀疏的互联网中提取信息，那种没有更好版本存在的东西，几乎是从头开始制作的。

即使存在这样的信息，也是充满了广告的，比如说迪士尼世界，你很难找到真实的信息。你可以去任何一个城市，问问有什么十大必做的事情，L_M_（语言模型）比互联网上的任何东西都要好。这是因为它们得到了大量的补贴，并通过广告获得收入。可能会先出现，也可能不会。但我认为这方面有明确的法律规定，你必须对此保持透明。但我认为这是大家担心的事情，比如隐藏的信息或类似的东西。这也引出了广告的话题，我认为这可能是他们计划在2025年推出的东西，因为目前还没有通过其他方式盈利。因此，他们可能会在其中加入广告位。

不过他们不能这样做，因为有不含广告的替代品，人们会转向其他产品。而且他们在这方面投入了大量资金，只是为了吸引用户。我认为Instagram广告就是一个例子，我虽然不用Instagram，但我理解在一个平台上找到真正喜欢你产品的用户的吸引力，这是Instagram广告的最佳情况。但也有很多情况下，广告对激励机制非常不利。我认为，如果人工智能能够与这种积极的广告观念结合，那将对世界非常有利，尤其是在数字基础设施方面，因为这就是现代网络的构建方式。但这并不意味着为了展示更多内容而让人上瘾的推送是好事。我认为这甚至是OpenAI想要探索的方向，他们希望找到一种方法，在广告的盈利潜力和用户自主性之间取得平衡。我个人认为，谷歌可能会更擅长于此，因为他们已经有广告供应，他们可以将Gemini应用中的需求转化为有用的广告，然后启动它。我不知道是否会在今年实现，但肯定会有实验。

我认为目前阻碍公司前进的主要是竞争对手没有这样做，这更像是声誉问题。人们现在害怕失去声誉和用户，因为如果有人推出这些广告，会成为头条新闻。

如果我们展望十年，广告的提议是，你可以通过拥有大量用户在广告上赚很多钱，然后用这些钱来推动更好的研发并制作更好的模型，这就是为什么YouTube在市场上占据主导地位，Netflix害怕YouTube的原因。他们通过广告赚的钱甚至超过我每月支付的28美元的会员费，他们在视频领域占据了如此强势的地位。

我认为广告的提议是，你可以在每个用户上花费更多，但现在这方面的钱太多了，启动这个飞轮是令人恐惧的，因为这是一个长期的赌注。

你认为今年会有一些商业上的大动作吗？比如谷歌或苹果收购Anthropic这样的事情？

Daria永远不会出售，但我们开始看到一些类型的整合，比如GROC以200亿美元的价格出售，Scale A_I_接近300亿美元，还有无数类似的交易，它们的结构实际上对硅谷生态系统不利，这是一种许可协议，而不是全面收购，这对普通员工有利，因为他们的股票会被兑现。这是硅谷文化需要解决的一个大问题，因为初创企业生态系统是生命线，即使你加入的初创公司不太成功，也可能会被以较高的溢价收购，你会因此获得股权收益，而这些许可协议通常会吸引顶尖人才。我认为GROC与NVIDIA的交易对员工来说可能更好，但它仍然是一种规避反垄断的措施，我认为这种整合趋势将继续。

我和许多我尊敬的聪明人一直预计整合会更早发生，但似乎有些事情开始转变，但同时也有公司为一些不太明智的原因筹集了大量资金。我不知道你为什么要拿那些钱。所以今年可能是混合的，但整合压力开始显现。

你认为我们会看到什么令人惊讶的整合？你说Anthropic是绝不会卖的，我的意思是GROC是一个大例子，顺便说一下，GROC是用一个立方体表示的。

是的。有很多初创公司，人工智能初创公司的溢价非常高，所以可能会有很多十亿美元级别的收购，这对一个可能一年前成立的初创公司来说是一个非常大的收购。我认为Main S_A_I_这家公司是Meta在新加坡成立的，成立仅八个月就以20亿美元的价格退出。我认为还会有其他大规模的数十亿美元收购，比如Perplex，传闻它们会被苹果收购。

我认为在人工智能（A_I_）领域有很大的压力和流动性。大公司面临着必须取得成果的压力，我猜想，一次大型收购能让他们有余地去讲述这个故事的下一个篇章。

我觉得这就像是一个开端。我们一直在谈论代码，然后有人收购了Cursor。他们因为拥有大量用户数据而处于非常有利的地位。我们谈到了持续学习等内容。他们在博客文章中有一句非常有趣的话，他们的新Composer模型是对来自中国的这些大型专家模型（mixture of expert models）的微调。你可以通过询问Gossip知道这一点，因为这个模型有时会用中文回应，而美国的模型都不会这样。他们在博客中提到，他们每九十分钟就根据用户的实际反馈更新模型权重，这是模型在现实世界中进行强化学习（R_L_）的最接近的表现，这真是太酷了。

顺便说一下，我经常使用Composer，因为它有一个好处就是速度快。大家都这么说，我也需要试试。

可能会有一些首次公开募股（I_P_O_）。你认为Nthropic、Open A_I_、X_A_I_呢？他们都能轻松筹集大量资金，所以只要融资容易，他们就不会I_P_O_，因为公开市场会带来压力。我认为我们在中国看到的生态系统有点不同，minimax和Z_ dot A_I_都在申请提交I_P_O_文件，看看中国市场的反应会很有趣。我其实猜测这会像美国一样充满炒作，只要这一切继续进行，而不是基于他们都在亏损大量资金的现实。

我希望更多美国的大型人工智能初创公司能上市，因为这样可以看到他们如何花钱，并获得更多的洞察力，同时也让人们有机会投资这些公司，因为我认为它们是这个时代最具影响力的公司之一。现在美国的许多大初创公司都不上市。就像我们还在等待Stripe和A_I_P_O_，但Databricks肯定没有，他们筹集了类似G轮融资。我觉得这对市场来说是种奇怪的平衡，我希望看到这些公司上市并以公司能发展的方式演变。

你认为十年后，一些前沿模型公司还会存在吗，比如anthropic和open A_I_？

我认为这不会是赢家通吃的局面，除非他们中有一个发现了某种算法上的秘密，比如飞轮效应（flywheel），因为他们的发展路径都很相似。Google和open A_I_都有相同的产品，而anthropic更专注，但当你和人们交谈时，听起来他们在解决很多相同的问题，所以我认为会有很多不同的产品出现，这是一个很大的蛋糕，大家都会从中分一杯羹。

我不想轻描淡写，但open A_I_和anthropic主要是大型语言模型（L_L_M_）服务提供商，而其他一些公司如Google和X_A_I_也涉及其他领域。因此，如果人工智能变得更加商品化，那些仅提供L_O_M_的公司可能会消亡。

我认为他们的优势在于拥有大量用户，我认为他们会转型。我觉得比如说entropic已经转型了。我不认为他们最初计划从事代码工作，但他们发现这是一个不错的利基市场，现在他们在这个领域感到舒适，并在这个领域继续努力。我可以看到同样的事情再次发生。假设Google占据了通用聊天机器人的市场份额，也许open A_I_会专注于其他子领域，因为他们有太多用户，不会在可预见的未来消失。

我认为Google总是准备好说，哇，可能超出了人工智能的模式。

我认为问题在于这些公司能否支持其估值。我认为人工智能公司在某些方面被看作是像A_W_S_、Azure和G_C_P_这样的公司，它们都在同一领域竞争，并且都是非常成功的企业。A_P_I_市场有可能如此不盈利，以至于它们向上和向下扩展到产品和硬件领域。它们有足够的资金来建造发电厂和数据中心，这现在是一个持久的优势。但也有可能这些A_P_I_对开发者来说如此灵活，以至于它们成为像A_W_S_一样的公司。但是A_W_S_和Azure也会有这些A_P_I_。所以在A_P_I_市场上有五六家公司竞争是很困难的。这可能就是为什么它们会被挤出市场。

你提到R_I_P_ llama。Meta有赢的路径吗？

我认为没有人知道，他们在做很多事情。他们正在与Black Forest Labs签署许可协议，后者是图像生成或中途旅程或客户主要的公司。所以我认为在某些方面，在产品和面向消费者的人工智能前沿，现在下结论还为时过早。我认为他们有一些优秀且非常有动力的人接近扎克伯格。所以我认为这个故事还有待展开。

LAMA有点不同，LAMA是这个组织最专注的表现，我不认为LAMA会得到同样程度的支持。我认为它是他们非常成功的品牌，所以他们可能仍会参与开放生态系统或将LAMA品牌延续到不同的领域。人们知道LAMA是什么吗？

你认为会有LAMA五吗？

不会是一个开放权重的。

这很有趣，我觉得也只是总结一下，我认为LAMA是开创性的开放权重模型，然后LAMA一二三受到了很多喜爱，但我认为后来发生的事情，只是猜测，我认为Meta的领导层，比如高管们，他们对LAMA感到非常兴奋，因为他们看到了它在社区中的受欢迎程度，然后问题在于尝试去货币化开放源代码，而不是货币化开放源代码，而是利用开放源代码在某种程度上引起更大的轰动，几乎感觉是强迫性的，开发这些非常大的LAMA四模型以达到基准测试的顶端。但我认为LAMA模型的目标不是在基准测试中名列前茅，击败比如ChatterPity或其他模型。我认为目标是拥有一个人们可以使用、信任、修改、理解的模型，这包括拥有较小的模型。它们不必是最好的模型，发生的事情是这些模型在基准测试中的表现比实际要好，因为我认为它们有特定的模型在偏好上进行了训练，在基准测试中表现良好。所以这有点像过拟合的事情，试图强迫它成为最好的。但与此同时，他们没有做出人们可以使用的小模型。我认为没有人能运行这些大模型。然后就有点奇怪。我认为这只是因为人们对推动前沿的头条新闻太过兴奋，我认为这是一件好事。

是的，我觉得它在政治压力下崩溃了，比如内部政治斗争和不一致的激励措施。我认为研究人员想要构建最好的模型，但有一层组织和管理层试图证明他们在做这些事情。然后有很多片段和传闻，关于一些糟糕的技术决策是如何做出的，以及它是如何介入的。看起来情况变得太糟糕，最终一切都崩溃了。但我们应该

嗯哼。

也要大力赞扬马克·扎克伯格。我认为这实际上是来自马克·扎克伯格的。从最高领导层说开源是重要的。我认为这就像是，如果这种情况存在，意味着可能会有一个Llama 5，他们从基准测试中吸取教训，并说我们将成为GPT-OSS（开源软件）并提供非常棒的开源库。

有人说这是马克和亚历山大·王之间的争论，他非常聪明，但更反对开源，考虑到他对AI组织有很大影响力，这似乎不太可能。因为看起来马克引入他是为了在指导AI方面提供新的领导援助，如果开放或封闭不再是模型的定义性质，我不认为这会成为马克和亚历山大之间的定义性争论。所以他们都很聪明。但我只是

我很难理解这一切，因为马克在2024年7月写了一篇文章，可能是当时最好的博客文章，论述了开源AI的理由，然后到了2025年7月，他们开始重新评估与开源的关系。所以这就像是，但我认为问题，呃，不是问题，但我认为我们可能也有点过于苛刻，我认为这导致了一些问题，因为我认为

我是说我们作为开源开发者或开源社区，因为我认为即使模型可能不是每个人所希望的那样，它也受到了很多反对，我认为这有点不幸，因为我可以看到作为一家公司，他们希望得到正面的头条，而不是没有头条或不是这些正面的头条，反而得到了负面的头条。然后这一切似乎对公司产生了不好的反响，我认为这也是一种反应，可能是出于怨恨的反应，就像

我们没有，我们试图做一些好的事情，我们试图给你一些很酷的东西，比如一个开源模型，现在你们就像是对我们持负面态度，即使是对公司来说，所以在这种情况下，看起来好像我们可能会改变主意，我想，我不知道。

是的，这就是X上的话语动态可能导致我们作为一个社区误入歧途的地方。因为有时候感觉随机的人选择他们喜欢或不喜欢的东西。也许我们可以在GROC和GROC Code FAST上看到同样的事情。我不认为人们在公开场合非常喜欢它。但很多人使用

它。所以如果你看看Reddit和X，他们并没有真正给予它来自编程社区的赞美。但他们使用它。可能和Llama一样。我不理解我不理解无论是正面炒作还是负面炒作的动态。我不理解。

我认为2025年的一个故事是美国感受到Llama的空缺，这就像是这些中国开源模型的崛起，以至于我在过去五个月中花费了很多精力在这上面，比如试图做政策工作以让美国投资于此。这就像嗯哼。

所以告诉我关于亚当的故事。

亚当项目是我称之为美国深度探索项目的开始，这对DCI（国防情报局）观众来说并不适用。但这是一个关于我职业生涯中最有影响力的事情的故事，那就是这些中国开源模型正在培养大量的力量，并且对在这些开源模型上进行构建的需求很大，特别是在对这些中国模型非常谨慎的美国企业中。

转向困惑，亚当项目，美国真正的开源模型是一个基于美国的倡议，旨在构建和托管高质量、真正开源的AI模型和支持基础设施，明确旨在与中国快速发展的开源AI生态系统竞争并赶上。

我想一句话总结就是，或者两句话。第一是一个命题，即开源模型将成为AI研究的引擎，因为这是人们开始的地方，因此拥有它们很重要。第二个是因此美国应该构建最好的模型，以便最好的研究人员在美国进行研究，而这些美国公司从成为AI研究发生地中获得价值。如果没有对开源模型的更多投资，我们在网站上有所有的图表

就像是Quinn Quinn Quinn Quinn，这些都是来自这些中国公司出色的模型，它们在美国、中国和国际上培养影响力，我认为美国在AI上的花费远远超过了创建开源模型的能力，这些模型比封闭实验室的尖端技术领先半代或一代，成本高达数亿美元，这对这些公司来说是一大笔钱，但不是很多钱。因此，我们需要一个集中的力量

想要做这件事的人，我认为我们得到了几乎全栈的人的签署参与，无论是政策。

那么来自政府的支持呢？

我不认为技术上在政府中的任何人公开签署了它，但我知道在拜登和特朗普政府中从事AI政策工作的人都非常支持在美国推广开源模型。我认为例如AI2从NSF（国家科学基金会）获得了一亿美元的拨款，持续四年，这是NSF有史以来最大的CSF（计算机科学基金）拨款，这是为了让AI2尝试这个，我认为这是一个起点。但最好的事情发生

有多个组织在构建模型，因为他们可以交叉传播想法并建立这个生态系统。我不认为如果只是LAMA向世界发布模型就能奏效，因为那样你可以看到LAMA可能会消失。同样的事情也适用于AI2，我不能是唯一一个构建模型的人。我认为这就像花费很多时间与人交谈，无论他们是否在政策方面。我知道Nvidia对此非常兴奋。我认为黄仁勋特别谈到了

这方面的紧迫性，他们在2025年做了很多事情，Nemetron模型更受关注，他们开始发布一些数据以及NVIDIA的开源模型，尤其是像NVIDIA这样规模的公司很少这样做，所以有进展的迹象，我们听说Reflection AI表示他们的20亿美元融资专用于构建美国开源模型，我觉得他们的公告推文就像一篇博客文章的声音，我认为那种文化

潮流开始转变。我认为在7月的时候，我们有四五个深海级别的中国开源模型，而美国是零。这是我发布这个的时刻，我就像哦，我想我必须在这上面花费精力，因为没有其他人会做。所以这需要很多人共同贡献，我不说亚当项目不是推动生态系统发展的东西，而是像我这样的人在做这种事情以传播信息。

Speaker 1

你喜欢2025年美国的AI（人工智能）行动计划吗？这个计划包括开源内容。白宫的AI行动计划中有一个专门的部分，标题是鼓励开源和开放权重AI，定义了这些模型并认为它们对创新和初创企业具有独特的价值。

Speaker 2

是的，我觉得AI行动计划是一个计划，但我认为它可能是政府出台的最连贯的政策文件之一，我希望它能大体上成功。我认识一些参与AI行动计划的人，挑战在于将政策变为现实。作为一名AI研究员，我不知道该如何实现，但总体上有很多事情是非常真实的，美国正在大力发展AI，有很多人听说过的各种问题，比如水资源利用等。我们应该能够在这个国家建设东西，但在建设过程中也不能破坏我们的国家，这是值得花费精力去做的。我认为这是联邦政府的角色，他们设定议程，而在AI方面，设定开放权重应该是首要考虑的部分，这是他们可以做的很大一部分，然后人们会考虑这个问题。

此外，对于这些公司的教育和人才来说，我认为这非常重要，因为否则，如果只有封闭的模型，你怎么能在某个时候让下一代人参与进来呢？因为否则，你只能在加入公司后才能学习，但那时你怎么招聘到有才华的人，怎么识别有才华的人。我认为开源对于很多事情来说都是如此，但也仅仅是为了教育公众和培养下一代研究人员。这是唯一的途径。我本可以通过讲述一个中国AI与威权国家整合并成为ASI（超人工智能）并接管世界的故事来让这个话题更具传播性，因此我们需要我们自己的美国模型。但我有意谈论美国的创新和科学，因为我认为这不仅是一个更现实的结果，而且是一个我们希望实现的世界。我还想说，即使是任何开放权重模型，我也认为是一个可变的模型。

Speaker 1

是的，我的观点是我们应该处于领先地位。但我认为值得简单地说出来，因为在AI生态系统中仍然有声音说我们应该考虑禁止发布开放模型，因为存在安全风险。我认为值得补充的是，我认为实际上这是不可能的，除非让美国有自己的防火墙，而众所周知这并不太有效，因为训练这些模型的成本，无论是一百万还是一亿美元，都是世界上想要有影响力的人可以承担的。所以这些模型将在全球范围内被训练。我们希望这些模型，尤其是我指的是有安全顾虑的模型，能够在全球范围内自由流动并进入美国，以便人们可以使用和学习。阻止这一点将是对我们互联网的巨大重构，这似乎是不可能的。

Speaker 2

你认为在这种情况下，中国的大型开放权重模型实际上对美国公司来说是一件好事吗？因为你之前提到，美国公司通常在发布开源版本时比他们实际使用的版本落后一代，比如G_P_T_O_S_可能不是最前沿的模型，JAMA三也可能不是，但他们这样做是因为他们知道这是安全的，但当他们看到这些公司看到例如有一个非常棒的深度学习版本3.2并且没有引起反响，没有安全风险时，这可能会再次鼓励他们发布更好的模型。也许在某种意义上这是非常积极的事情。

Speaker 1

完全同意。这些中国公司推动了一些事情的发展，我认为如果他们不发布模型，这些事情可能不会发生。所以我几乎可以肯定，这些讨论已经在领导层中进行过了。

Speaker 2

有没有可能的未来，世界上的主流模型，AM模型都是开源的？

Speaker 1

这取决于你预测的进展轨迹。如果你认为饱和和进展会在几年内到来，尤其是在财政支持仍然非常好的时候，那么开放模型将被优化到如此程度，运行成本如此低廉，以至于它们将胜出。这实际上回到了开源的理念，更多的人会投入资金来优化这些开放权重的通用架构，它们将成为标准，然后你可以有专门的芯片用于它们，这将比这些定制的封闭公司提供的产品更好。

Speaker 2

我们应该说，AI二十七报告从叙述的角度预测的事情之一是会有大量的集中化。随着AI系统变得越来越智能，国家安全问题将出现，你会集中实验室，变得超级保密，并且从军事的角度来看，中美之间会有一场竞争。所以我们现在关于L_M_s的所有有趣对话，士兵们会走进房间说，好吧，我们现在处于这个事情的曼哈顿计划阶段。

Speaker 1

我认为2025、2026、2027年，我不认为这样的事情甚至有可能发生。你可以对计算机做出同样的论点，你可以说计算机是有能力的，我们不希望公众获得它们，或者甚至是AI芯片，但你看到华为现在制造芯片了，虽然花了几年时间，但我认为没有办法遏制这样的知识。在这个时代，我认为这是不可能的，就像互联网一样，我不认为这是一个可能性。

Speaker 2

关于曼哈顿计划，我觉得一个关于开放模型的曼哈顿计划实际上是相当合理的，因为它不会花费太多。但我认为这将会到来，但似乎文化上公司正在改变。但我同意Sebastian刚才所说的所有事情，我只是觉得我看不到它的发生，也看不到它有帮助。

Speaker 1

是的，我的意思是曼哈顿计划背后的推动力是存在文明风险。如何为开源模型激励这一点更难。

Speaker 2

没有文明风险。

Speaker 1

你认为在硬件方面，我们提到过很多次的NVIDIA，你认为Jensen和NVIDIA会继续赢吗？

Speaker 2

我认为他们的缺点是他们需要进行大量迭代和制造，我认为他们可能在做的事情上确实有创新，但我认为总是有可能有人做一些根本不同的事情，运气很好，然后做出一些东西，但问题是我认为采用率，比如NVIDIA的模式可能不仅仅是GPU，更像是CUDA生态系统，它已经发展了大约二十年。我认为

[SPEAKER 1] 即使在我还是研究生的时候，我所在的实验室就已经在进行生物物理模拟和分子动力学研究了。那时候我们就有一台Tesla GPU（图形处理单元）用于计算，那已经是十五年前的事了。他们在这方面积累了很长时间的经验。我认为这不仅仅是芯片本身的问题，虽然他们现在有资金去迭代、构建和扩展，但真正重要的是兼容性。作为一家大公司，为什么要选择那些每年只能生产少量芯片的风险项目呢？你会选择一个大的品牌。不过，我确实认为现在有了LLMs（大型语言模型），设计类似CUDA（并行计算平台和应用程序编程接口）的东西会更容易。你知道，这花了十五年，因为这很难。但现在我们有了LLMs，也许可以复制CUDA。训练和推理计算的分离。随着我们逐渐稳定下来，推理需要的计算机越来越多。这就是Grok收购的意义所在。这也是Vera Rubin的一部分，他们有一个新的芯片，没有高带宽内存或者只有很少的高带宽内存，这是最昂贵的部分之一。它是为预填充设计的，这是推理的一部分，你基本上需要进行大量矩阵乘法，然后在进行自回归生成时才需要内存，并且有KV缓存交换。因此，他们有一个新的GPU专门为这种特定用例设计的，然后每次浮点运算的拥有成本实际上要低得多。但我认为Nvidia（英伟达）的命运仍然与AI（人工智能）的普及有关。他们最大的客户仍然是这些超大规模公司，比如谷歌显然可以制造TPUs（张量处理单元），亚马逊正在制造Tranium，微软也会尝试做自己的事情，只要AI的进步速度很快，Nvidia的平台是最灵活的，人们就会需要它。但如果出现停滞，那么就有时间去创造定制芯片。

[SPEAKER 2] 有趣的是，Nvidia在努力开发各种不同的产品。他们试图创造出能使用大量TPUs的商业价值领域。嗯。他们不断创新，并且他们做了很多令人难以置信的研究。每个人都说这家公司非常以Jensen为中心，他在运营上非常投入。这听起来与我听说过的许多其他大公司非常不同。只要这种文化存在，我认为他们会继续取得进展。他就像是在苹果的Steve Jobs时代。只要这是他们的运作方式，我对他们的情况就持乐观态度。因为这就像是他们的首要问题。我不知道为整个生态系统制造这些芯片是否是所有这些公司的首要目标。他们会做得很好，但可能不会做得那么好。

[SPEAKER 1] 既然你提到了Jensen，我最近读了很多关于历史和历史上单一人物的书。你们怎么看待历史上的单一人物观？在科技行业中，个人对历史方向的引导有多重要？你知道，没有Jensen的Nvidia是什么？你提到了Steve Jobs，没有Steve Jobs的苹果是什么？没有Elon的XAI是什么？或者没有Demis的DeepMind是什么？

[SPEAKER 2] 人们更早、更快地做事情，很多伟大的科学家都认为自己是在正确的时间出现在正确的地点，仍然做出了创新，最终其他人也会有这个想法。所以我认为在这种情况下，Jensen帮助加速和集中实现了GPU革命，如果没有他在那儿，这一切会慢得多。这使得整个AI的建设更快。但我仍然认为，像ChatGPT这样的东西最终会出现，这样的建设会发生，但可能不会这么快。或者我认为这就是应用的那种风格。

[SPEAKER 1] 这些个人是那些在某些事情上下注的人。有些人运气好，有些人不走运。但如果没有这些人在掌舵，事情会更加分散。这几乎就像投资ETF（交易所交易基金）与投资个股。个股可能会上涨，也可能会下跌得更厉害，而ETF更平衡，随着时间的推移最终会上涨，我们会到达那里。但这就像是专注，我认为这是关键。嗯，充满激情的专注。

[SPEAKER 2] 真的有理由说，如果没有Jensen，就没有深度学习革命的重新激活？

[SPEAKER 1] 我会说这可能会晚二十年。或者如果没有GPU，可能会有另一个AI冬天。

[SPEAKER 2] 历史会完全不同。因为你可以想到在此期间可能出现的所有其他技术，人类文明的焦点可能会不同，硅谷的价值可能会被另一种热潮所捕获。

[SPEAKER 1] 但我确实认为这在某种程度上是计划好的，GPU的轨迹，但另一方面也有很多幸运的巧合。例如，所有好的直觉，比如投资于生物物理模拟，或者我认为它是从视频游戏开始的，然后碰巧在线性代数方面表现很好，因为视频游戏需要大量线性代数，然后是生物物理模拟，但我仍然不认为AI是他们的总体计划。我认为这只是碰巧是Alex Krzyzewski。有人拿这些GPU来训练神经网络，结果效果很好，我认为这只是因为你可以购买这些GPU。

[SPEAKER 2] 嗯。

[SPEAKER 1] 这就是我的想法。我认为GPU会因为Alex而有所不同，但我认为在AlexNet和Transformer出现时，GPU仍然会存在。只是很难知道会是一个公司如此成功，还是多个小公司拥有更差的芯片。但我不认为这会是一个百年的延迟。可能是十年的延迟。

[SPEAKER 2] 我实在无法想象Intel或AMD在UCPA上做Nvidia所做的事情。

[SPEAKER 1] 嗯。像硅图形公司之类的。

[SPEAKER 2] 但从表面上看，这些单一人物，这些领导者对世界的轨迹有巨大的影响。显然他们背后有令人难以置信的团队。但你知道，拥有那种非常单一、几乎教条式的专注是取得进展所必需的。

[SPEAKER 1] 是的，即使是GPT，如果没有Ilya这个推动扩展的人，也不会存在。我是说，Daria也深度参与其中。想想这些人是多么早就决定需要连接一万个GPU，使用OpenAI的所有计算资源来训练一个模型，这几乎是疯狂的。那里有很多人不想这样做。

[SPEAKER 2] 再次提到单一人物。说到这一点，一百年后，这可能是奇点之后，无论奇点是什么，当历史学家回顾我们现在的时代时，他们会特别强调哪些技术突破是导致奇点的突破？到目前为止，我们从图灵到今天，已经八十年了。

[SPEAKER 1] 我认为仍然是计算，像计算这样的大伞术语。我不一定认为即使是一百年、两百年后会是AI，它可能仍然是计算机。我们现在只是更好地利用计算机，但计算的事实。

[SPEAKER 2] 这基本上是摩尔定律的讨论。你甚至不会记得代码和GPU的细节。也不会是所有的软件动荡。显然就是计算。

我基本同意，但问题在于互联网的连接性和计算能力是否能够合并，或者它们各自独立？我认为互联网可能会与通信相关，比如手机网络、卫星等。而计算则更多是它的扩展方面。也有可能互联网会被完全遗忘，互联网被整合到电话网络中，就像通信网络一样。这只是另一种表现形式。而真正的突破来自于计算能力的提升，这就是广义上的摩尔定律（Moore's Law）。

我认为人与人之间的连接是非常基础的。你可以和任何人交流，找到世界上最优秀的人，他们就在世界的某个地方。能够实现信息的流动，人工智能（A_I_s）也会依赖于此。我一直在关注的是，当我说一个中央模型的梦想已死时，正在演变的是人们有许多代理来处理不同的任务。人们总是开始用不同的云（clods）来处理不同的任务，这被描述为数据中心中的许多通用人工智能（A_G_I_s），每一个都进行管理并相互交流。这非常依赖于网络和信息的自由流动，尤其是在计算的基础上。网络，尤其是图形处理单元（G_P_Us），是扩展计算能力的重要组成部分。数据中心中的G_P_Us需要相互通信。

关于神经网络的任何事情都会被记住。你认为神经网络有什么非常具体和独特的地方吗？它似乎突破了，就像你在非常粗糙的方式中复制了人类大脑的结构。没有人类大脑，我们可能不会有神经网络，因为它只是一个灵感来源。但另一方面，我认为它是如此不同，数字与生物的对比，我认为它可能更像是被归类为一种算法。这种算法在这种特定类型的计算上是高度可并行化的。

可能会是遗传计算（genetic computing），就像遗传算法（genetic algorithms）一样并行化。我认为这只是因为它更高效，效果更好。而且很可能我们现在构建的神经网络（L_M_）只是通向奇点的系统中的一个小组件。

如果你从百年的角度来看，社会可能会因为更多的计算和智能而发生更大的变化，因为自主性。但就像看待工业革命中我们记住的东西一样。我们记得引擎可能相当于计算机，但还有很多其他的物理变革，比如棉花机（cotton gin）和这些仍为人知的机器，空调、冰箱等。人工智能（A_I_）的一些东西仍会被记住。比如transformer这个词可能仍然会被记住。我猜深度学习（deep learning）肯定会被记住，但transformer可能会在一百年后的人工超级智能（A_S_I_）研究中被淘汰。但我认为深度学习可能是一个会被记住的术语。

我想知道未来的空调和冰箱会是什么样子，是人工智能带来的。如果我们现在向前穿越一百年，世界会有什么不同？你认为会有机器人到处走动吗？

我确实认为会有专门的机器人用于某些任务。可能是半人形的。我们会看到。我认为在某些情况下会有类人机器人，因为它适合环境。但在某些任务中可能有意义。更难想象的是我们如何与设备互动，人类如何使用设备。我很确定可能不会是手机，也不会是笔记本电脑，可能是植入物。我想这必须是基于脑机接口（brain computer）的，对吧？一百年后，考虑到我们现在看到的进步，除非我们与现实互动的方式发生了彻底改变。

另一方面，如果你想想汽车，汽车已经有一百多年的历史了，对吧？它的界面仍然是一样的，我们没有用其他东西取代汽车，我们只是让汽车变得更好，但仍然是方向盘，仍然是轮子。

我认为我们仍然会随身携带一个物理计算设备，因为人们希望能够拥有一些私人空间。你可能不会像使用手机那样频繁使用它，但拥有一个可以存储私人信息的界面，作为与互联网的接口，我认为这仍然会存在。它可能不像iPhone，也可能使用频率大大降低，但我仍然预计人们会随身携带东西。

比如加密信息、加密照片，你知道你的生活是什么样的。我想这是一个关于你对脑机接口有多乐观的问题。如果所有这些都存储在云端，你的整个日历呢？很难想象通过脑机接口处理我们可以通过视觉处理的所有信息，像日历这样的东西。很难想象不通过看就知道你的邮件箱。你向计算机发出信号，然后你就知道你的邮件箱。这是人类大脑能够非视觉地处理的信息吗？我不太清楚这些转变是如何发生的，因为人类在一百年内不会改变。

一个本地社区，比如你亲近的人，能够与他们一起做事，能够赋予你的生活意义并做事情。我认为即使在一百年内，人类生物学也不会在我们可以讨论的时间尺度上发生变化。我认为普遍基本收入（U_B_I_）不能解决自主性。我确实期望大量财富，并希望它能被分配，以便在一百年后，普通人的生活看起来非常不同。如果你考虑到那些在发展过程早期的国家获得计算和互联网接入，建立所有基础设施并制定政策以分享一个国家的财富，这是很多事情。

我认为在一百年内看到所有这些事情发生是一个乐观的观点，同时它们仍然是独立实体，而不是被某种国际秩序强行吸收。

但可能会有更好、更复杂、更有效的社会支持系统，帮助减轻世界上的一些基本痛苦。你知道，社会的转型在短期内会导致许多工作岗位的流失，我们必须真正记住，每一个失去的工作岗位都是一个正在遭受痛苦的人。当工作岗位大规模流失时，这是一场真正的悲剧。你可以提出各种经济学的论点，或者说一切都会好起来，这对国内生产总值（G_D_P_）有好处，会创造新的工作岗位，但从根本上说，对于那个个体来说，这是真实的痛苦。这是一种真正的个人悲剧，我们在开发技术时不能忘记这一点。

[SPEAKER 1] 而且，我对我们看到的所有AI（人工智能）混乱的希望是，面对面的基本人类体验会越来越有价值，那些我们都喜欢的面对面交流的事情。在接下来的几年里，实体商品和活动的价值肯定会增加，而混乱的压力也会更大。所以，这种混乱才刚刚开始，未来几年会有越来越多样化的混乱版本。

[SPEAKER 2] 嗯。

[SPEAKER 1] 即使是经典的例子，我真的认为这是事实，我认为我们会对此感到厌倦。我们已经有点厌倦了。即使是艺术，我不认为艺术会消失。因为你有画作，实体的画作，它们不仅仅是货币价值，还有对真实画作的欣赏价值，而不是那幅画的复印件。那可能是完美的数字再版，但当你去博物馆看那幅艺术作品时，你看到真实的东西，你会想到，好吧，一个人类，我，

[SPEAKER 1] 这就像一种工艺，你会对它有一种欣赏，我认为写作、交流或任何类型的体验都是如此。我不幸地认为这会像是一个分叉，有些事情会被自动化，比如你知道的，200年前的画作没有那么多。现在有更多的照片，更多的复印件。但同时它不会消失。我认为这其中会有价值。我认为区别只是

[SPEAKER 1] 有点，你知道，那是什么比例。但就我个人而言，我很难阅读那些显然是AI生成的东西。我会想，抱歉，那里可能有很好的信息，但我有一种感觉，不适合我，我想。

[SPEAKER 2] 最终他们会骗过你。这将是在提供验证或建立信任方式的平台上。所以你会相信Alexa不是AI生成的，因为你在这里。所以你对这个渠道有信任。但对于没有这种信任的新来者来说就更难了。

[SPEAKER 2] 嗯。

[SPEAKER 1] 这是真的，这不是真的。会有一些明显的迹象让你知道这是AI生成的，而这不是。但它们不会，我的意思是有些会好到难以辨别，然后你必须信任，这会变得有趣且有点麻烦。

[SPEAKER 2] 嗯。嗯。

[SPEAKER 1] 就像人工编辑，这是与试图给AI图像加水印的讨论相反的，然后你可以制作一个带水印的Google图像，并使用另一个Google工具去除水印。是的，这会是个麻烦。

[SPEAKER 1] 我的意思是，还有我们一直在谈论的所有能力，可以用来破坏人类文明，即使是相对愚蠢的AI在大规模应用时，然后是更进一步的超级智能AI系统。当然，有一种悲观的看法，在我们开发这些技术时需要稍微考虑一下。嗯，是什么让你对人类文明的未来充满希望？我们一直在谈论的一切。

[SPEAKER 2] 我们会没事吗？

[SPEAKER 1] 我认为我们会没事。我对AI和非AI的事情都很担忧，但人类确实倾向于找到出路。我觉得这就是人类的本能，建立社区并找到解决问题的方法，这就是我们走到今天的原因。并且认为AI相关技术的机会真的很大，我认为有很大的社会和政治问题需要

[SPEAKER 1] 每个人都理解这一点。我认为这就是我们现在面临的很多问题，世界是一个可怕的地方，而AI是一个非常不确定的东西。这需要很多工作，而不仅仅是构建东西。就像告诉人们并理解那些构建AI的人历史上并不被激励或想要去做，但这是可能做到的事情，只是需要比人们想象的更长的时间。我们必须经历那段漫长的

[SPEAKER 1] 艰难的AI讨论时期，如果我们想要获得持久的利益。

[SPEAKER 2] 是的，通过这个过程，我特别兴奋的是我们有机会更好地了解自己。无论是作为个体的人类，还是作为文明。

[SPEAKER 2] 解答一些重大谜团，比如这个意识的东西到底是什么，这似乎真的很特别，就像我们的思想中有一个真正的奇迹，而AI让我们反思自己，并解答一些关于这个到底是什么的重大问题。

[SPEAKER 1] 关于这一点，我确实认为使我们与AI非常不同的原因，也是我不担心AI接管的原因，就像你说的意识，我们人类决定我们想做什么，而AI在目前的实现中，我看不到它会改变，你必须告诉它做什么。所以你仍然有主动权，它不会从你那里夺走主动权，因为你必须，你可以把它当作一个工具，你告诉它做什么。它会比以前的工具更强大，当然比锤子更强大，它可以解决问题，但仍然是你在掌控，对吧。所以AI不是在掌控，你在掌控，你告诉AI做什么，它为你做。

[SPEAKER 2] 所以在后奇点、后末日的人机战争中，你是说人类值得为之奋斗。

[SPEAKER 1] 百分之百，我的意思是，这是八十年代拍的电影《终结者》，我确实认为，唯一我能看到出错的地方当然是，如果事情被明确地编程为做有害的事情。

[SPEAKER 1] 我实际上认为在那种《终结者》类型的设定中，我认为人类会赢。

[SPEAKER 2] 嗯。

[SPEAKER 1] 我认为我们太聪明了。

[SPEAKER 1] 嗯，很难解释我们是如何解决问题的，但我们会做到，并且我们可能会使用本地的开源L_L_M（大型语言模型）来帮助对抗机器。嗯，我为这种荒谬感到抱歉。就像我说的，Nathan已经知道我。我一直是他的忠实粉丝很长时间了，也是你的忠实粉丝，Sebastian，所以很荣幸终于见到你。谢谢你为世界所做的一切，谢谢你写的优秀书籍，谢谢你教会我们。

[SPEAKER 2] 嗯，谢谢你今天的谈话。这很有趣。

[SPEAKER 1] 谢谢你邀请我们来这里，建立这种极其有价值的人际联系。

[SPEAKER 2] 感谢收听与Sebastian Raschke和Nathan Lambert的对话。要支持这个播客，请查看描述中的赞助商，您还可以在那里找到联系我的链接，提出问题，给予反馈等等。现在让我用爱因斯坦的一些话来结束。

[SPEAKER 2] 并不是我有多聪明，而是我对问题的坚持更久。

[SPEAKER 2] 感谢您的收听，希望下次再见。

Speaker 1

The following is a conversation all about the state-of-the-art in artificial intelligence, including some of the exciting technical breakthroughs and developments in A_I_ that happened over the past year, and some of the interesting things we think might happen this upcoming year. At times it does get super technical, but we do try to make sure that it remains accessible to folks outside the field without ever dumbing it down. It is a great honor and pleasure to be able to do this kind of episode with two of my favourite people in the A_I_ community, Sebastian Raschka and Nathan Lambert. They are both widely respected machine learning researchers and engineers who also happen to be great communicators, educators, writers, and Twitterers ex posters. Sebastian is the author of two books I highly recommend for beginners and experts alike. First is build a large language model from scratch and build a reasoning model from scratch. I truly believe in the machine learning computer science world the best way to learn and understand something is to build it yourself from scratch. Nathan is the post-training lead at the Allen Institute for A_I_ and author of the definitive book on reinforcement learning from human feedback. Both of them have great X_ accounts, great sub-stacks, Sebastian has courses on YouTube, Nathan has a podcast, and everyone should absolutely follow all of those. And now a quick few second mention of each sponsor, check them out in the description or at LexFriedman.com/sponsors. It is in fact the best way to support this podcast. We got a bunch of great sponsors, Box for intelligent content management, Quo for your phone system, like call, text, contacts for your business, Uplift, desk, the desk I'm sitting behind and my favourite office desk, Thin for customer service A_I_ agents, shop flat for selling stuff online, code rabbit, for A_I_ powered code review, element for electrolytes, and of course our long time friend for plexity. For curiosity driven knowledge exploration, choose wisely my friends. And now on to the full ad reads. I try to make 'em interesting, but if you do skip, please still check out the sponsors. I enjoy their stuff. Maybe you will too. To get in touch with me, for whatever reason go to Lex room dot com slash contact. If you uh can't tell, I'm trying to have a bit of a pep in my step at the moment because I had a long night, didn't get much sleep at all, so I am running on fumes, delirious, happy, unsure of what is reality and what is a dream. In fact, we could right now be living inside of a dream. I have been going through a lot. I have been working insane hours, so much going on, I am so overwhelmed, of course, as always truly grateful and happy to be alive, but have not been able to publish as many episodes as I would like, so there's a bunch of sponsors we'll have to catch up on. Your support truly means the world. Please check out all the sponsors, if you think it might be useful to you, buy their stuff, it really is the best way to support this podcast. Alright, let's go. First up, this episode was brought to you by Box, a cloud based platform for content management, file sharing, and uh all kinds of collaboration or all kinds of content for your businesses, like with a lot of companies, the big question is how is A_I_ leveraged to make whatever the business does better? A lot of companies kinda use it for the hype and the label. It's it's kinda hilarious to watch people just say like powered by A_I_ and like I don't care if you're a bakery by A_I_. I don't know. But outside of all of the hype, it is one of the most incredible things that humans have ever created. And so companies that can leverage that well are the companies that win. And of course Box is uh legendary for its file and content management, especially when you're talking about scale. So obviously it's amenable for the utilization of A_I_ to help automate some of the document some of the workflows, some of the organisation, and they do that exceptionally well. They have a system called, as you could imagine, Box A_I_ that does just that. I love it to do an excellent implementation on the interface side, on the back-hand side, everything works extremely nicely. Help scale A_I_ across your organisation today and go to box.com/A_I_ that's box.com/A_I_ to learn more. This episode is also brought to you by quote spelled Q_ U_ O_. Also happens to be a company name with just three letters that will help you win at Scrabble. Are you allowed to use company names with Scrabble? How many points is Q_? How many points is U_. I'm imagining a lot. That was one of the big confusions to me when I was first learning the English language. It always felt like Q_ should be at the end of the alphabet, maybe like Q_Z_. It was always to my limited brain capacity, that Q_ was earlier on in the alphabet. What is it, O_P_Q_? I can't even actually localize letters in the alphabet, I'm sure that's the case for a lot of people, without reading the alphabet in my head sequentially. All of this has to do with short-term and long-term memory access, the functioning, the limitation of human cognition, and maybe cognitive systems in general. All of it relevant to this particular episode, and not so relevant to the awesomeness of Quo formerly known as open phone that I should be talking about. Of course, as is always the case, I think the point here and at the point everywhere and the point of life is to talk from the heart about whatever you want, and that's what I try to do with everything, and to generalize that even more, to talk whenever I want. and to shut the F_ up whenever I want and listen. And I prefer that more often than I prefer to talk. Insert clever transition here because talk is somehow relevant, it is. So Quo formerly known as uh open phone helps over ninety thousand businesses manage uh phone calls, texts, contacts, all kinds of phone related stuff for business. You have a bunch of customer a bunch of incoming calls, a bunch of people on the business side that have to m answer those calls, have to manage it, what's the status of this particular request, voicemails transcripts, all that kind of stuff and obviously uh really nice effective utilization of A_I_ to make that really efficient. But really what's really important for things like this is that the interface is good, that team collaboration is good and quote delivers on that. Try quote for free plus get twenty percent off your first six months. you go to quote dot com slash Lex, that's Q U O dot com slash Lex. Tell your friends about it, because it just might help 'em when it's scrabble. Speaking of scrabble, you usually wanna play scrabble on a table. It's such a magical experience. I just had a vision from a distant past of me sitting with a friend and playing scrabble at a table. What is this life? Full of beautiful memories. And then it's over too soon. Yeah. That melancholy feeling is uh beautiful, I think. Insert another clever transition, ala Mark Norman maybe, because of the name of this next company's uplift desk. Uh as I said, uh okay, it's my go-to uh favourite office desk, and it's also the desk that I use for podcast furniture. I I already lost count. I have a lot of uplift desks, standing desks in my place everywhere. It's desks everywhere. I have a mattress on the floor and uplift desks. So um I have a Linux box for robotics. I have a machine where I do a lot of the editing. All of that is on a desk. I have the three tables for the podcast desk, the the w the very one you s seen over the past several years. That's all uplift desks. I usually don't put them in standing mode, but they are tasks that allows me to do all kinds of stuff, really easy to work with, really nice material, really sturdy. I just love everything about upload tasks. When they said they wanna sponsor after I've been using 'em for many years, I lost my mo I love it when I've been in love with a company, in love with their product for such a long time, and I get to also sing 'em praises. I mean come on, what are you gonna tell me next that F_F_M_ PEG wants to sponsor this podcast? Uh another sort of open source project It's not a company that I've been uh in love with Anyway. go to upliftdesk dot com slash Lex and use code Lex to get four free accessories, free same day shipping, free returns, a fifteen year warranty and an extra discount off your entire order. That's U_P_L_I_F_T_D_E_S_K_ dot com slash Lex The. spelling it out really help anybody I. don't know but, they really said pretty please, the one request is spell it out. Again, what is this life? Incredible. This episode is also brought to you by Finn, the number one A_I_ agent for customer service. Find the niche and become number one. That's the idea here. Anybody building an A_I_ company, and we talk about this, is the dream of A_G_I_ dead, uh I think for a lot of companies success is in the niche. But there is a few, and FINN delivers on that niche. It's trusted by over six thousand customer service leaders at top companies, including A_I_ companies. When an A_I_ company trusts your company to do its customer service, that means you're legit. Ninety day money, bad guarantee, up to one million dollars, built to handle complex multi-step queries like returns exchanges, and disputes. Go to finn.ai/lex to learn more. about transforming your customer service and scaling your support team. That's finn.ai/lex. I don't know why I switched to this hyping voice. Crappy announcer, crappy radio jockey, crappy ad-read voice. It is what it is. Thank you for sticking with me this long. I feel the love and I send it right back at you. This episode. is also brought to you by a company whose engineers are also full of love Shopify. It just brings a smile to my face. Every time I think about Shopify, I got to see their engineering booth at uh NeurIPS, which is a machine learning conference. Really brilliant people, wonderful people. Of course the CEO Toby is still programming, still building stuff, still in on the details of the engineering, and now is talking quite a bit about utilization of L_L_M_s for his own sort of pet projects, but also inside the company. It's just incredible when from the very top the company is in love with engineering. It's a celebration of great engineering. Just like the conversation with D_H_H_, who is the guy behind Ruby on Rails that Shopify was built on, that conversation was a celebration of great engineering. The beauty of engineering as well. Anyway, listen to that episode to uh to see some of the magic of uh Ruby on Rails and the magic of Shopify and the magic of Tobii that we talk about. Anyway, sign up for a one dollar per month trial period at Shopify dot com slash luxe. That's all lower case. Go to Shopify dot com slash luxe to take your business to the next level today. This episode is also brought to you by CodeRabbit, a platform that provides A_I_ powered code reviews directly within your terminal. we talk a lot in this episode about the timeline for the full automation of the human programmer. I think we're quite far away from taking the human out of the loop. That review process, the debugging process, all of that, that's such a crucial part of uh programming, especially just like we talk about in the episode when we're not talking about a personal website where H_T_M_L_ SLOP is something that a web browser magically, automagically, I don't know how they're possibly able to do such incredible job of rendering SLOP, but a web browser is in fact able to uh render SLOP, including A_I_ SLOP. It just finds a way. So really the question is when you have production code, something that a lot of users are relying on, how do you review that code? How do you make sure you're catching the errors? How or you making sure that uh you put a backstop to hallucinations and the logical errors that A_I_ coding agents can generate Anyway. code rabbit supports all programming languages. Install code rabbit C_L_I_ today at code rabbit dot A_I_ slash LEX That's. code rabbit dot A_I_ slash LEX This. episode is also brought to you by Element, my daily zero sugar and delicious electrolyte mix. Reminds me of the fact that I need to get to editing the video of me in the jungle when uh Paul Rosling and I are such an incredible human. Congratulations to Paul on all all of his success. Go get his book. It's an incredible book. Again, he's an incredible person with an incredible mission. And yes, I need to edit and publish, hoping to at the very least. um the story of our journey in the jungle because it was a beautiful celebration of nature and the jungle and friendship and the full richness of the human experience. It was beautiful. The reason I mention that is always as part of that journey uh severely dehydrated and I remember dreaming of element of a cold drink of water with the electrolytes. Your body craze it and it craze it because it needs it. sodium, potassium, magnesium. When you're deprived, it's not just water, it's electrolytes. So anyway, I always remember that. Get a free eight count sample pack with any purchase, try it at drinkelement dot com slash Lex. This is the Lex treatment podcast. To support it, please check out our sponsors in the description where you can also find links to contact me, ask questions, get feedback and so on. And now, dear friends, here's Sebastian Raschka and Nathan Lambert. So I think uh one useful lens to look at all of this through is the deep seek so-called deep seek moment. This happened about a year ago in January twenty twenty five when the open-weight Chinese company deep seek released deep seek R_ one that uh I think it's fair to say surprised everyone with uh near or at state of the art performance with allegedly much less compute, far much cheaper and from then to today the A_I_ competition has got insane, both on the research level and the product level, it's just been accelerating. Let's discuss all this today and maybe let's start with some spicy questions if we can. Uh who is winning at the international level? Would you say it's a set of companies in China or the set of companies in the United States? And Sebastian Nathan, it's good to see you guys. Uh so Sebastian who, do you think is winning? Um so winning is a very broad uh you know term. I I would say you mentioned the deep seek moment, and I do think deep seek is definitely winning the hearts of the people who work on open weight models because they share these as open models. Um winning I think has multiple time scales to it. We have today, we have next year, we have in ten years. One thing I know for sure is that um I don't think nowadays, two thousand twenty six, that there will be any company who is let's say having access to a technology that no other company has access to. And that is mainly because researchers are frequently changing jobs, changing labs, they uh rotate. So I don't think there will be a clear winner in terms of technology access. However, I do think there will be uh the differentiating factor will be budget and hardware constraints. So I don't think the ideas will be proprietary, but the way or the resources that are needed to implement them. And so I see currently uh take it all scenario where a winner takes it all. I I can't see that at the moment. Uh Nathan, what do you think? You see the labs put different energy into what they're trying to do and I think to demarcate the point in time when we're recording this um the hype over anthropics Claude Opus four point five model has been absolutely i insane which is just I mean I've used it and built stuff in the last few weeks and it's uh it's almost gotten to the point where it feels like a bit of a meme in terms of the hype and it's kind of funny because this is very organic and then if we go back a few months ago we can get the release date and the notes is Gemini three from Google got released and it seemed like the marketing and just like wow factor of that release was super high, but then at the end of November, Claude Opus four point five was released and the hype has been growing, but Gemini three was before this and it kind of feels like people don't really talk about it as much, even though when it came out everybody was like this is um Gemini's moment to retake kind of Google structural advantages in A_I_ and Gemini three is a fantastic model and I still use it, it's just kind of differentiation is lower and I with Sebastian what you're saying with all these like the idea space is very fluid, but um culturally anthropic is known for betting very hard on code, which is cloud code thing is working out for them right now. So I think that even if the ideas flow pretty freely, so much of this is bottlenecked by human effort and kind of culture of organisations where anthropic seems to at least be presenting as the least chaotic. It's it's a bit of an advantage and if they can keep doing that for a while, but on the other side of things there's a lot of ominous technology from China where there's way more labs than DeepSeek. So DeepSeek kicked off a movement within China, I say kind of similar to how chat G_B_T_ kicked off a movement in the U_S_ where everything had a chat bot. Uh there's now tons of tech companies in China that are releasing very strong frontier open weight models to the point where I would say that DeepSeek is kind of losing its crown as the preeminent open model maker in China and the likes of um Z_ dot A_I_s with their G_L_M_ models, Minimax's models, um Kimi shot uh especially in the last few months has shown more brightly. The new deep seek models are still very strong, but that's kind of a it it could look back as a big narrative point where in twenty twenty five deep seek came and then all and it kind of provided this platform for way more Chinese companies that are releasing these fantastic models to kind of have this new t type of operation. So these models from these Chinese companies are open-weights and depending on this trajectory of business models that these American companies are doing could be at risk. But currently lot of people are paying for A_I_ software in the U_S_ and historically in China and other parts of the world, people don't pay a lot for software. So some of these models like DeepSeek uh have the love of the people because they are open-weight. Uh how long do you think uh the Chinese companies keep releasing open-weight models? I would say for a few years I think that like in the U_S_ there's not a clear business model for it. I've have s been writing about open models for a while and these Chinese companies have realized it, so I get inbound from some of them. And they're smart and realize the same constraints which is that a lot of U_S_ tump tech companies and other I_T_ companies won't pay for a A_P_I_ subscription to Chinese companies for security concerns. This has been a long standing um habit in tech and the people of these companies then see open weight models as an ability to influence and take of a huge growing A_I_ expenditure market in the U_S_. And they're very realistic about this. And it's working for them, and I think that the government will see that that is building a lot of influence internationally in terms of uptake of the technology. So there's gonna be a lot of incentives to keep it going, but building these models and doing the research is very expensive. So at some point I expect consolidation, but I don't expect that to be a story of twenty twenty six, where there'll be more open model builders throughout twenty twenty six than there were in twenty twenty five and a lot of the notable ones will be in China. You were gonna say something? Um yes, you mentioned DeepSig losing its crown. I do think to some extent yes, but we also have to consider though they are still I would say slightly ahead and the other ones it's not that DeepSig got worse, it's just like the other ones are using the ideas from DeepSig for example you mentioned Kimi, the same architecture, they're training it and then again we have this leapfrogging where they might be at some point in time a bit better because they have the more recent model and I think this comes back to um the f the fact that they be a clear winner and it's will it will just be like like that and one person releases something the, other one comes in and the the recent the most recent model is probably always the best model. Yeah. We'll also see that Chinese companies have different incentives. So like deep seek is very secretive where some of these startups are like the minimaxes and Z_ dot A_Is of the world. Those two literally have filed I_P_O_ paperwork and they're trying to get Western mind-share and do a lot of outreach there. So I don't know if these incentives will kind of change the model development 'cause deep seek famously is built by a hedge fund, a high flyer capital, and we don't know exactly what they look we don't know what they use the models for or if they care about this. the communication, they're not secret in terms of the technical reports that describe how their models work. They're still open on that front. And we should also say on the Opus four five hype, there's the layer of uh something being the darling of the X echo chamber, on Twitter echo chamber, and the actual amount of people that are using the model. I think it's probably fair to say that CHI G_P_T_ and JAM and I are focused on the broad user base that just want solve problems in their daily lives, and that user base is gigantic. So the hype about the coding may not be representative of the actual use. I would say also um a lot of the usage patterns are like you said name recognition, brand uh and and stuff but also muscle memory almost where um you know like Chachapeti has been around for a long time, people just got used to using it and it's kind of like almost like a flywheel, they recommend it to other users and that stuff One. interesting point is also the customisation of uh L_M_M_s for example, Chachapeti has a memory feature right, and so you may have a subscription and you use it for personal stuff, but I don't know if you to use that same thing at work, you know, because that's a boundary between private and work. If you're working at a company, they might not allow that. Or you may not want that. And I think that's also an interesting point where you might have multiple subscriptions. One one is just clean code. It keeps n has nothing of your personal images that you or hobby projects in there. It's just like the work thing. And then the other one is your personal thing. So I think that's also something where two different use cases and it doesn't mean you only have to ha have one. It's it's I think the future is also multiple ones. what model do you think won twenty twenty five? And what model do you think is gonna win twenty six? I think in the context of a consumer chatbots is a question of are you willing to bet on Gemini over chat G_B_T_ which I would say in my gut feels like a bit of a risky bet because open A_I_ has been the incumbent and there's so many benefits to that in tech that I think the momentum I feel like in twenty twenty five was on Gemini's side but they were starting from such a low point I think on R_I_P_ barred and these earlier attempts uh uh of getting started I think huge it for them for powering through the organisational cr chaos to make that happen. But also it's hard to bet against Tratt to open A_I_ because they always come off a cast as so chaotic, but they're very good at landing things. And I think like uh personally I have very mixed reviews of G_P_T_ five, but it had to have saved them so much money, with the headline feature being a router where most users are no longer charging it like charging their G_P_U_ costs as much. So I think it's very hard to dissociate the things that I like out of models versus the things that are gonna actually be a general public differentiator. What do you think about twenty twenty six? Who's gonna win? I'll say something, even though it's risky, I will say that I think Gemini will continue to take progress on chat G_P_T_ I think Google scale when both of these are operating at such extreme scales and like Google has the ability to separate that research and product a bit better where you hear so much about web A_I_ being chaotic operationally and chasing the high impact thing which is a very start-up culture. And then on the software and enterprise side I think entropic will have continued to success as they've again and again been set up for that and Obviously Google's cloud has a lot of offerings, but I think this kind of like Gemini name brand is important for them to build. And and Google's cloud will continue to do well as but that's kind of a c more complex thing to explain in the ecosystem because that's competing with the likes of Azure and AWS rather than on the model provider side. So in the infrastructure you think T_P_U_s give an advantage? largely because the margin on NVIDIA chips is insane and Google can develop everything from top to bottom to fit their stack and not have to pay this margin and they've had a head start in building data centers. So all of these things that have both high lead times and very hard margins on high costs, Google has a just kind of a historical advantage there and uh if there's gonna be a new paradigm, it's most likely to come from open A_I_ where they're kind of their research division again and again has kind of shown this ability to land a new research idea or a product, I think like deep research, SORA, O_ one thinking models, like all these definitional things have come from open A_I_ and that's gotta be one of their top mm trades as an organisation. So it's kind of hard to bet against that, but I think a lot of this year it will be about scale and optimising what could be described as low-hanging fruit in models. And clearly there's a trade-off between intelligence and l speed This. was what Chad G_P_T_ five was trying to solve behind the scenes. It's like do people actually want intelligence? The broad public, or do they want speed? I think it's a nice variety actually, or the option to uh have a toggle there. I mean first for my personal usage, most of the time when I look something up I use chat G_P_D_ to ask a quick question, get the information I want it fast. For you know most daily tasks I use the quick model. Nowadays I think the auto mode is pretty good where you don't have to specifically say thinking or you know non-thinking and stuff. Then again I also sometimes want the pro mode. Very often what I do is when I have something written I put it into uh chat G_P_D_ and say hey do, a thorough check is are all my references correct? Are all my thoughts correct? Uh did I make any formatting mistakes? And are the figure numbers wrong or something like that? And I don't need that right away. It's something okay, I finish my stuff, maybe have dinner, let it run, come back and it goes through this. And I think see this is where I think it's important to have this option. I would go crazy if for each query I would have to wait thirty minutes or ten minutes even. Yeah. non-thinking model, I'm like oh, how do you how do you live with how do you live with that. It's like my reaction I'm, been heavily on chat G_P_T_ for a while um, never touched five non-thinking, I find it t it's tone and then it's propensity of errors. It's just like a high likelihood of errors. Some of this is from back when opening I released O_ three, which was the first model to do this deep search and find many sources and integrate them for you. So it became habituated with that, so I will only use G_P_T_ five point two thinking or when I'm finding any sort of information query for work, whether that's a paper or some code reference that I found. And it's just like I've I will regularly have like five pro-queries going simultaneously, each looking for one specific paper or feedback on the equation or something. I have a fun example where I just needed to answer as fast as possible for this podcast before I was going on the trip. Um I have like a local G_P_U_ running at home and I wanted to run a long R_R_L_ experiment. And usually I also unplug things, because you never know if you're not at home, you don't wanna have things plugged in, and I accidentally unplugged the G_P_F_. It was like my wife was already in the car, and it's like oh dang, and then basically I wa wanted, as fast as possible, a bash script that runs my different uh experiments and evaluation, and I did something I know I learned how to use the bash uh interface be well bash terminal, but i in that moment I just needed like ten seconds give me the command. This is a hilarious situation, but yes, what did you use? Uh so I did the f non-thinking fastest model. It gave me the bash uh command. I uh to chain different uh uh scripts to each other, and then the thing is like you have the T_ thing where you want to route this to a o log file. Top of my head I was just like in a hurry, I could have thought about it myself. By the way, I don't know if there's a representative case, wife waiting in the car, you have to run, you know, plug the G_P_U_ you, have to generate a bash script, this sounds like a movie like, you wish it were possible. I use Gemini for that. So I use thinking for all the information stuff and then Gemini for fast things or stuff that I've could come time to Google which is like it's good at explaining things and I trust that it has this kind of background of knowledge and it's simple and the Gemini app has gotten a lot better and it's good for that sort of things. And then for code and any sort of philosophical discussion I use Claude Opus four point five. Also always with extended thinking. I extended thinking and inference time scaling is just a way to make the models um marginally smarter and I will always edge on that side when the progress is very high because you don't know when that'll unlock a new use case, and then sometimes use GROC for um real time information or finding something on A_I_ Twitter that I knew I saw and I need to dig up and I just fixated on. Although I when GROC four came out the GROC four what is super heavy which was like their pro variant was actually very good and I was pretty impressed with it and that was just kind of like muscle memory, lost track of it with having the chat T_V_T_ app open. So I used many different things. Yeah, I actually do uh do use GRAP for heavy for debugging, for like hardcore debugging and the other ones can't solve it. I find that it's the best at and I it it's interesting 'cause you say ch ch G_P_T_ is the best interface. Uh for me for that same reason, but this could be just momentum, uh Gemini is the better interface for me. I think because I fell in love with their best needle in the haystack. If I ever put something that has a lot of contacts, but I'm can have very specific kinds of information make, sure it tracks all of it. I find at least uh the Gemini for me has been uh the best. So it's funny with some of these models, if they win your heart over for one particular feature at one on a one particular day for that particular query uh th that prompt, you're like this model is better. And so you'll just stick with it for a bit until it does something really dumb there's, like a threshold effect some, smart thing and then you fall in love with it. And then it does some dumb thing and you're like you know what I'm, gonna switch to tri-cloud or J_G_P_T_ and all that kind of stuff. this is exactly like you use it until it breaks until you have a problem and then then you ch uh change uh the L_M_ and I think i it's the same how we use anything like uh our favourite text editor um operating systems or m the browser I. mean there are so many browser options Safari, Firefox, Chrome, uh all the c relatively similar but then there are ex etch cases maybe extensions you wanna use and then you switch but, I don't think there is any w one who types the same thing like the website into different browsers and compares them, you only do when the website doesn't render if something breaks, I think. So that's that's a good point. I think you use it until it breaks and then you explore other options, I think. On the long context thing, I was also a Gemini user for this, but the G_P_T_ five point two release blog had like crazy long context scores where a lot of people were like, did they just figure out some algorithmic change? It went from like thirty percent to like seventy percent or something, and this minor model update. So it's also very hard to keep track of all of these things, but now I'm m look more favourably at G_P_T_ five point two is long context, so it's just kinda like how do I actually get to d testing this s never-ending battle. Well, it's interesting that none of us talked about the Chinese models from a user usage perspective. What does that say? Does that mean the Chinese models are not as good, or does that mean we're just very biased uh and U_S_ focused? I do think that that's co currently the discrepancy between just the model and the platform. So I I think the open models, they are more known for the open weights, not their platform yet. Mm-hmm. So these models from the U_S_ are better and th in terms of the outputs I think that the question is will they stay better for this year and for years going, but it's like so long as they're better I'm gonna pay for you to use them. I think there's also analysis that shows that like the way that the Chinese models are served that you could argue due to expert controls or not is that they use fewer G_P_Us for replica which makes them slower and have different errors and it's like speed and intelligence. If these things are in your favour as a user, I think in the U_S_ a lot of users will go for this and I think that that is a good thing that will spur these Chinese companies to want to compete in other ways, whether it's like s s free or substantially lower costs or it'll breed creativity in terms of offerings which is good for the ecosystem. But I just think of a f a simple thing is that U_S_ models are currently better and we use them and I try Chinese mo I try these other open models and I'm like fun but not gonna I don't go back to it. Uh we didn't really mention programming. That's another use case that a lot of people deeply care about. So I use basically half and half cursor and clog code because there I find them to be like fundamentally different experience and both useful. Uh what do you guys you program quite a bit so what what do you use what's the current vibe? So I use the codecs plug-in for V_S_ code uh you know it's very convenient it's just like a plug-in and then it's a chat interface that has access to your repository. I know that clog code is I think a bit different. It is a bit more agentic, it touches more things, it does a whole project for you. I'm not quite there yet where I'm comfortable with that because uh maybe I'm a control freak but I still would like to see a bit what's going on and codecs is kind of right right now for me like the sweet spot where it is helping me but it is not taking completely over. I should mention one of the reasons I do use clod code is to build the skill of programming with English. I mean the experience is fundamentally different. You're as opposed to micromanaging the details of the process of the generation of the code and uh looking at the diff which you can incurs there i if that's the I_D_ you use and and in changing altering, looking and reading the code and understanding the code deeply as you progress versus just kinda like thinking in this design space and just guiding it at this uh macro level, which I think uh is another way of thinking about the programming process. Also we should say that Claude code, it just seems to be somehow a better utilisation of Claude Opus four five. It's a good side-by-side for people to do. So you can have Claude code open, you can have cursor open, and you can have V_S_ code open, and you can select the same models on all of them and ask questions that are very interesting. Like the the Claude code is way better in that domain, it's remarkable. we should say that both of you are are legit on multiple fronts, researchers, programmers, educators, tweeterers, and on the book front too. So Nathan, at some point soon hopefully has an R_L_H_F_ book coming out. It's available for pre-order and there's a full digital pre-print just making it pretty and better organised for the physical thing, which is a lot of why I do it, because it's fun to create things that you think are excellent in the physical form when so much of our life is digital. I should say going to perplexity here Sebastian, Rasha is a machine learning researcher and author known for several influential books, a couple of them that I wanted to mention which is a book I highly recommend build a large language model from scratch and the new one build a reasoning model from scratch. So I'm really excited about that. Building stuff from scratch is one of the most powerful ways of learning. honestly building an L_M_ from scratch is a lot of fun, it's also a lot of l to learn and like you said it's probably the best way to learn how something really works 'cause you can look at figures, but figures can have mistakes. Uh you can look of con uh concepts explanations, but you might m misunderstand them. But if you see the co there is code and the code works, you know it's correct. Uh I mean there's no misunderstanding, it's like it's precise, otherwise it wouldn't work. And I think that's like kind of like the beauty behind coding, it is kind of like it doesn't lie, it's math basically. So even though with math I think you can have mistakes in a book you would never notice, because you're not running the math when you are reading the book, you can't verify this. And with code what's what's nice is you can verify it. Yeah, I agree with you about the L_M_ from Scratch book. It's nice to tune out everything else, the internet and so on and just focus on the book. But you know I, read uh several like you know uh history books. It's just less l lonely somehow. It's really more fun. Like yeah, for example on the programming front, I think it's genuinely more fun to program with an L_L_M_, and I think it's m genuinely more fun to read with an L_L_M_, but you're right like this distraction should be minimised. So it's uh you use the L_L_M_ to basically enrich the experience, maybe add more context, maybe the g I just the rate of aha moments for me in a small scale is really high with the L_M_s. hundred percent I would s I also want to correct myself I'm not suggesting not to use L_M_M_s uh I suggest doing it in multiple passes like one pass just off-line focus mode and then after that uh I mean I also take notes but I I try to resist the urge to f immediately look things up I. uh do a second pass it's just like for me more structured this way and I get le I mean sometimes things are answered in the chapter but sometimes also it just helps to let it sink in and think about it. Other people have different preferences I would highly recommend using L_L_M_s when reading books. For me it's just it's not the first thing to do it's like the second pass. by way of recommendation as I say it I do the opposite I like to use the L_M_ at the beginning to lay out the full context of like what is this world that I'm now stepping into but I try to avoid clicking out of the L_M_M_ into the world of like Twitter and blogs and because then you're now down this rabbit hole you're reading somebody's opinion there's a flame war about a particular topic and all of a sudden you're no longer you're now in the l in the realm of the internet and Reddit and so on. if you're purely l letting the L_L_M_ give you the context of why this matters what are the big picture ideas uh but sometimes books themselves are good at doing that but not always so, is why I like the chat G_P_T_ app 'cause it gives the A_I_ a home in your computer when you are folk you can focus on it rather than just being another cab in my mess of internet options and I think cloud code in these particularly does a good job of making that a joy where it seems very engaging as a product designed to be an interface that your A_I_ will then go out into the world and is something that is very kind of intangible to between it and codecs is that it just feels kind of warm and engaging where codecs can often be as good from open A_I_ but it just kind of like feels a little bit rougher on the edges whereas the cloud code is makes it fun to build things, particularly from scratch where you just don't like you don't have to care but you trust that it'll make something like obviously this is good for websites and kind of refreshing tooling and stuff like this which I'd use it for or data analysis so I'd my my blog we scrape hugging pace so we keep the download numbers for every data set and model over time now so we have them and it's like cloud is just like yeah I've made use of that data no problem and all like that would've gonna be days mostly. Then I have enough situational awareness to be like okay these trends obviously make sense and you can check things. So that's just a kind of wonderful interface where you can have an intermediary and not have to do the kind of awful low level work that you would have to do to maintain different web projects and do this stuff. Alright so we just talked about a bunch of the closed weight models. Well let's talk about the open ones. Uh so tell me about the landscape of open L_M_ models. Which are interesting ones which stand out to you and why? We already mentioned DeepSeek. Do you wanna see how many we can name off the top of our head? Yeah yeah, without looking at notes. DeepSeek, KIMI, Minimax, Z_ dot A_I_, Ant, Lang. We're just going Chinese. Um let's go in Mistral A_I_, JAMA, um G_P_T_O_S_, the open source model by uh Chet G_P_T_ Actually. in NVIDIA NemoTron had a or NVIDIA had a very cool one, uh NemoTron three. Um there there's a lot of stuff uh especially at the end of the year. Quen, one maybe the one pressing. I was trying to get through the you can get at least ten Chinese and at least ten Western. I think that I mean, Opening Eye released their first open model since G_P_T_ two. That wa when I when I meant talk when I was writing about opening Eye's open model release, they're all like don't forget about G_P_T_ two, which I thought was really funny 'cause it's such a different time. But G_P_T_O_S_ is actually a very strong model and does some things that the other models don't do very well, and I think that selfishly I'll promote a bunch of like Western companies, so both in the us in Europe have these like fully open models. So I work at Allen Institute for A_I_ We've. been building OLLMO, which releases data and code and all of this. And now we have actual competition for people that are trying to release everything so that other people can train these models. So there's the Institute for Foundation Models, where it slash L_M_ three sixty, which is like had their K_ two models of various types. Aparis is a Swiss research consortium. Huggingface um has small L_M_, which is very popular um in Nvidia's Neematron has started data as well, and then Stanford's Marin community project, which is kind of making it so there's a pipeline for people to open a GitHub issue and implement a new idea and then have it run in a stable language modelling stack. So this space, that list was way smaller in twenty twenty four, so I think it was like just A_I_ too, so that's a great thing for more people to get involved and to understand language models, which doesn't really have a like a Chinese company that is has an analogue. While I'm talking, I'll say that the Chinese open language models tend to be much bigger, and that gives them this higher peak performance as M_O_E_s where a lot of these things that we like a lot, whether it was JEMMA um in NIMA-TRAN, have tended to be smaller models from the U_S_ which is which is starting to change from U_S_ U_S_ in Europe. Um MISTRA large three came out which was a giant M_O_E_ model very, similar to deep seek architecture in December. And then a start-up R_C_A_I_ and both NIMA-TRAN have NIMA-TRAN as NVIDIA have teased M_O_E_ models of this uh way bigger than a hundred billion parameters, like this four hundred billion parameter range coming in this like Q_ one twenty twenty six timeline. So I think this kind of balance is set to change this year in terms of what people are using the Chinese versus U_S_ open models for, which will be an what I'm personally s gonna s be very excited to watch. First of all, huge props for being able to name so many of these. Did you actually name llama? Um no. If you like R_I_P_. Well it's not on purpose. Alright P_ llama, alright can you mention what are some interesting models that stand out? So you mention QUIN three is is is obviously a stand out. So I would say the years almost book ended by both uh deep seek version three and R_ one, and then on the other hand in December uh deep seek version three point two, because what I like about those is they d always have a interesting architecture tweak that others don't have. But otherwise if you wanna go with um you know like the familiar but really good performance uh QUIN three and like um Nathan said also G_P_D_O_S_S_, and I think G_P_D_O_S_S_ what's interesting about it is kind of like the first public or like open-weight model that was really trained with use in mind, which I do think is kind of a little bit of a paradigm shift where the ecosystem was not quite ready for it. So with tool use I mean that the L_L_M_ is able to do a web search to call a Python interpreter. And I do think this uh it's a stand-off because I think it's a huge unlock because uh one of the most um common complaints about L_L_M_s are for example hallucinations, right. And so in my opinion one of the best ways to solve uh hallucinations is to not try to always remember information or m make things up. For why not use a calculator app or Python. If I asked the N_L_M_ who won the, I don't know, soccer World Cup in nineteen ninety eight uh instead of just trying to memorize, it could go do a uh search. I think mostly it's usually still uh Google search. So G_G_P_D_ G_P_D_O_S_S_ they would do a tool call to Google, maybe find the FIFA website, find okay it was France. It would get you that information re reliably instead of just trying to memorize it. So I I think it's a huge unlock uh which I think right now is not fully utilized it by the open source, open weight ecosystem. A lot of people don't use tool call modes because I think it's first it's a trust thing. You don't wanna run this on your computer where it has access to tools, could wipe your hard drive or whatever. So you wanna maybe h contain containerize that. Um but I do think, you know, that that is like a really important step um for the upcoming years to have this uh ability, you know. So uh a few quick things First. of all thank, you for defining what you mean by tool use. I think that's a great thing to do in general for the concepts we're talking about. Even things as sort of well established as M_O_Es uh y you have to say that means mix your X_ person. I mean you could kinda have to build up an intuition for people what that means, how it's actually utilized, what are the different flavors. So w what does it mean that there's just such explosion of open models? What's your intuition? an open model, you want people to use it as the first and foremost thing, and then after that comes things like transparency and trust. I think when you look at China, the biggest reason is that they want people around the world to use these models, and I think a lot of people will not. If you look outside of the U_S_, a lot of people will not pay for software, but they might have computing resources where you can put a model on it and run it. I think it there can also be data that you don't want to send to the cloud. So this the the number one thing is getting people to use models, use A_I_ or use your A_I_ that might not be able to do it without having access to the model. I guess we should state explicitly so we've been talking about these Chinese models and open weight models oftentimes the way they're run is locally. So it's not like you're sending your data to China or to whoever developed uh to Silicon Valley whoever developed the model. a lot of American startups make money by hosting these models from China and selling them selling tog it's called like selling tokens, which means somebody will call the model to do some se some piece of work. I think the other reason is for U_S_ companies like to have j opening eyes, so G_P_U_ deprive like they're so they're at the limits of the G_P_U_s. Whenever they make a release they're always talking about oh like our G_P_U_s are hurting and I think there's like m like in one of these like G_P_T_O_S_S_ release sessions Sam Altman said like oh we're releasing this because we can use your We don't have to use we ju don't have to use our G_P_Us and, openly I could still get distribution out of this, which is an another very real thing 'cause it doesn't cost them, so anything. And for the user I think also I mean there are users who just use the model locally how they would use uh G_P_D_ but also for companies uh I think it's a huge unlock to have these models because you can customise them, you can train them, you can uh add post-training con uh uh add more data like specialise them into let's say law medical models, whatever you have. And the appear you mentioned NAMA the, appear of the open weight models from China is that the open weight models are also the licenses are even friendlier I. think they are just unrestricted open source where if we use something like uh llama or gemma there are some strings attached. I think it's like a upper limit in terms of how many users you have and then if you exceed I don't know so many million users you have to report your finance um situation to let's say meta or something like that and I think well it is a free model but there are strings attached and people do like s n things where strings are not attached. So I think that's also one of the reasons uh besides performance why the open weight models from China are so popular because you you can just use and there's no there's no catch in that sense, yeah. The ecosystem has gotten better on that front, but mostly downstream of these new providers providing such open licenses. That was funny when you pulled up Perplexity and said Kimmy K_ two thinking hosted in the U_S_ which is just like an exact I've never seen this, but it's an exact example of what we're talking about where people are sensitive to this. So like Kimmy K_ two thinking and Kimmy K_ two is a model that is very popular, people say that it has very good like creative writing and also in doing some software things. So it's just these little quirks that people pick up on with different models that they like. Uh what are some interesting ideas that some of these models have explored that you can speak to, like that particular interesting to you? Maybe we can go chronologically. I mean there was of course deep seek um deep seek R_ one that came out in January if we just focus on two thousand twenty five. However this was based on deep seek version three which came out the year um before in December two thousand twenty four. Uh there are multiple things on the architecture side. What is fascinating is you you can still I mean that's what I do in my from scratch coding projects. You can still start with G_P_D_ two and you get can add things to that model to make it into this other model. So it it's all still kind of like the same lineage, the same it is a very uh relationship between those. But uh top of my head, DeepSeek, what was uh unique there is the mixture of exp uh not I mean they were not inventing mixture of experts. We can maybe talk a bit more what mixture of experts means. Um but just to list these things first before we dive into a detail mixture of experts, but then they also had a multi-head latent attention, which is a tweak to the attention mechanism, where this was I would say two thousand twenty five, the main distinguishing factor becau between these uh open weight else different tweaks to make inference or K_V_ cache size. We can also define K_V_ cache in a few moments. But to kind of make it more economical to have long contacts to shrink the K_V_ cache size. So what are tweaks um that we can do and most of them focused on the attention mechanism. There is multi-head latent attention in in deep seek. There is uh group query attention which is still very popular. It's not invented by any of those models. It goes back a few years. But that that would be the other option. Sliding window attention I think three uses it um if I remember correctly. So there are these different tweaks that make the models different. Otherwise um I put them all together in article ones where um I just compare them. They are very surprisingly similar. It's just different numbers in terms of how many rep petitions of the transformer block you have in the centre uh and like mm just little little knobs that people tune. But but what's so nice about it is it's it it works no matter what. You can tweak things, you can move the normalisation layers around, get some performance gains and I al almost always very good in evaluation studies showing what actually uh what it does to the model if you move something around. Um evaluation studies doesn't make it better or worse. But there are so many let's say ways you can implement a transformer and make it still work. Big ideas um that are still prevalent is mixture of experts, uh multi-attend attention, um sliding window attention, group query attention, and then at the end of the year we saw a focus on making the attention mechanism scale linearly with inference token So there were QAM three next, for example, which added a gated delta net. It's it's like um kind of like inspired by um state space models where you have a fixed state that you keep updating, but it makes essentially this attention cheaper or it replaces attention with a cheaper operation. And it may be is it useful to step back and talk about transform architecture in general? Yeah, so maybe we should start with the G_P_T_ two architecture, the transformer that was derived from the attention is all you need paper. Uh so the attention uh is all you need paper had a transformer architecture that had two parts, an encoder and a decoder. And G_P_T_ went just focusing in on the decoder part. It is essentially a still a neural network um and it has this attention mechanism inside. And you predict one token at a time, you pass it through an embedding layer, there's the transformer block. The transformer block has uh tension modules and a fully connected layer. And there are some normalization layers in between. But it's essentially neural network layers with this attention mechanism. So coming from G_P_T_ two, uh when we c move on to G_P_T_ O_S_S_ there is for example the mixture of experts um layer. It's not invented by G_P_T_ O_S_S_, it's a few years old. Um but it is essentially a t a tweak to make the model larger without consuming more compute in each forward pass. there is this uh fully connected layer, and if listeners are familiar with um multi-layer perceptrons, you can think of a mini multi-layer perceptron, a fully connected neural network layer inside the transformer, and it's very expensive because it's fully connected. If you have thousand inputs thousand, outputs, that's like a m one million connections, and it's a very expensive part in this transformer, and the idea is to kind of expand that into multiple feed-forward net uh networks. So instead of having one, let's say you have two and fifty But it would make it way more expensive, because now we have twenty fifty six. But you don't use all of them at the same time. So you now have a router that says, okay, based on this input token, it would be useful to use this um fully connected network. And in that context it's called an expert. So a mixture of experts means we have multiple experts. And depending on what your input is, uh let's say it's more math heavy, it would use different experts compared to let's say translating input text from English to Spanish. It would maybe console different experts. not quite clear I mean as clear cut to say okay, this is only an expert for math and for Spanish is a bit more fuzzy, but the idea is essentially that you pack more knowledge into the network, but not all the knowledge is used all the time. That would be very wasteful. So you are kind of like during the token generation you are more selective, there's a router that selects which tokens should go to which expert. It's more complexity, it's harder to train, there's a lot of you know that can go wrong, like collapse and everything. So I think that's why Alma three still uses uh dense. I mean you have I think Alma models with mixture of experts, but dense models uh where dense means uh so also it's jargon. There's a distinction between dense and sparse. So mixture of experts is considered sparse because we have a lot of experts, but only few of them are active. So that's called sparse. And then dense would be the opposite where you only have like one fully connected module and it's always, you know, utilised. So maybe maybe it's a good place to also talk about K_V_ cache, but actually before that even zooming out, like fundamentally how many new ideas have been implemented from from G_P_T_ two to today. Like how different really are these architectures? Mm-hmm. layer norm by R_M_S_ norm, but it's just like a different normalization layer and not a big change, it's just like a a tweak. Um the non-linear activation function um people familiar in with deep neu networks, I mean it's the same as changing sigmoid with relu it's it's not changing the network uh fundamentally, it's just like a tweak like a little little tweak. Um and and that's about it I would say, it's not really fundamentally that different, it's still the same same architecture, so you can convert one from one uh you can go from one into the other by just adding these changes basically. Mm-hmm. Yep. So for example you mentioned my book earlier, that's uh G_P_D_ two model in the book, because it's simple and it's very small um, so a hundred twenty four uh one hundred twenty million parameters approximately. But in the bonus materials I do have almost three from scratch, GEMMA three from scratch, and other types of from scratch models. And I always start it with my G_P_D_ two model and just you know tweak the f well edit different components and you get from one to the other. It's like it's kind of like um lineage in a sense, yeah. But intuition for people, because uh sort of when you zoom out you look at it there's so much rapid advancement in the A_I_ world, and at the same time fundamentally the architectures have not changed. So where is all the turbulence the, turmoil of the advancement happening? Where where is the gains to be had? you have the pre-training. Now um back then they it was just pre-training with G_P_T_ two. Now you have pre-training, mid-training and post-training. Um so I I think right now we are in the post-training focus stage. I mean pre-training still gives you um advantages if you scale it up to better higher quality data. But then we have capability unlocks that were not there with G_P_T_ two. For example uh CHET G_P_T_ it is basically a G_P_T_ three model and and G_P_T_ three is the same as G_P_T_ two in terms of architecture. What was new was adding the um supervised fine tuning and the reinforcement learning with human feedback. So it's more on the algorithmic side rather than the architecture. I would say that the systems also change a lot I. think if you listen to NVIDIA's announcements they talk about these things like you now do F_P_ eight, you can now do F_P_ four and what is happening is these labs are figuring out how to d utilize more compute to put it into one model which lets them train faster and that lets them put more data in and then you can find better configurations faster by doing this. can look at like the essentially the tokens per second per G_P_U_ is a metric that you look at when you're doing large scale training. And you could get you can go from like ten K_ to thirteen K_ by s turning on F_P_ eight training, which means you're using less memory per parameter in the model. And by saving less information you do less communication, you can train faster. So all of these like system things underpin way faster experimentation on data and algorithms that is kind of like it's this s it's this kind of where it's kind of hard to describe when you look at the architecture and they're exactly the same but, the code base used to train these models is gonna be vastly different and you could probably like I don't the G_P_Us are different but, you probably train G_P_T_O_S_S_ twenty B_ way faster in wall clock time than G_P_T_ two was trained at the time. for the speed, this is true, but uh it it doesn't give the model new capabilities in a sense. It's just how much can we make make the computation coarser without suffering in terms of model performance degradation. Um but I do think I mean there are alternatives popping up to the transformer. There's text diffusion models, a completely different paradigm. Um and there's also I mean though text diffusion models might use transformer architectures, but it's not an autoreg autoregressive um transformer. And also MAMBA models uh it's a state space model. But they do have trade-offs and uh what's right is there's nothing that has replaced the um autoregressive transformer as state of the art model. So like for state of the art you would still do that g go with that thing. But there are no alternatives for the cheaper and like alternatives that m are kind of um making compromises, but i it's not just one architecture anymore. There are little ones coming up. But if we talk about the state of the art, it's pretty much still the the transformer architecture autoregressive. Derived from G_P_T_ two essentially. I guess the big question here is we talked quite a bit here on the architecture behind the m the pre-training. Are the scaling laws holding strong across pre-training, post-training, inference, contact size, data, synthetic data? I like to start with the technical definition of scaling law which kind of informs all of this. The scaling law is a power law relationship between you could think of the X_ axis, so kind of what you are scaling as a combination of compute and data which are kind of similar, and then the Y_ axis is like the held out prediction accuracy over next token. So we talk about models being auto-regressive, it's like if you keep a set of mm text that the model has not seen, how accurate will it get when you'll train. And the idea of scaling laws came when people figured out that that was a very predictable relationship and I think that that technical term is continuing and then the question is like what do users get out of it and then there are more types of scaling where um open A_I_s O_ one was famous for introducing inference time scaling and I think less famously for also showing that you can scale reinforcement learning training and get kind of this log X_ axis and then a linear increase in performance on Y_ axis. So there's kind of these three axes now where the traditional scaling laws are talka talked about for pre-training, which is how big your model is and how big your data set is. And then scaling, reinforcement learning, which is like how long can you do this trial and error learning that we will talk about will define more of this. And then this inference time compute, which is just letting the model generate more tokens on a specific problem. So I'm kind of bullish where they they're all really still working, but the low hanging fruit has mostly been taken, especially in the last year on um reinforcement learning with verifiable rewards, which is this R_L_V_R_ and then inference time scaling. which is just why these models feel so different to use where previously you would get that first token immediately. And now they'll go off for seconds minutes or even hours generating these hidden thoughts before giving you the first word of your than answer and that's all about this inference time scaling which is such a wonderful kind of step function in terms of how the models change abilities. They've kind of enabled this tool use stuff and enabled this much better software engineering that we were talking about. And this is w when we say enabled, almost entirely downstream of the fact that this reinforced learning with verifiable rewards training just kind of let the models pick up these skills very easily. So let the models learn. So if you look at the reasoning process when the models are generating a lot of tokens, what it'll be often doing is it tries a tool, it looks at what it gets back, it tries another A_P_I_, it sees what it gets back and if it solves the problem. So the models when you're training them very quickly learn to do this and then at the end of the day that gives this kind of general foundation where the model can use C_L_I_ commands very nicely in your repo and handle git for you and move things around and organise things or search to find more information, which if we're sitting in these chairs a year ago, it's something that we didn't really think of the models being doing. So this is just kind of something that has happened this year and is totally transformed how we think of using A_I_, which I th think is very magical. It's such an interesting evolution and just so p unlocks so much value. But it's uh it's like it's not clear what the next avenue will be in terms of unlocking stuff like this. I think that there's there's uh we'll get to continual learning later, but there's a lot of buzz around certain areas of A_I_ but no one knows when the next step function will r will really come. So you you've actually said quite a lot of things there and said profound things quickly. It would be nice to unpack them a little bit. You say you're bullish basically on every version of scaling. So can we just even start at the beginning. Pre-training. Are we kind of implying that the low hanging fruit on pre-training scaling has been picked? Is is h is pre-training hit a plateau or is even pre-training still you're bullish on? has gotten extremely expensive. I think to scale up pre-training, it's also implying that you're gonna serve a very large model to the users. So I think that it's been loosely established that the likes of G_P_T_ four and similar models were around one trillion like this order of trillion parameters at the biggest size. There's a lot of rumors that they've actually gotten smaller as training has gotten more efficient. P you want to make the model smaller because then your costs of serving go down proportionately. These models, the cost of training them is really low relative to the cost of serving them to hundreds of millions of users. I think DeepSeek had this famous number of about five million dollars for pre-training at cloud market rates, I thi almost three. Um section two point four in the paper we'd just detailed how long we had the G_P_U_ clusters sitting around for training, which includes engineering issues, multiple seeds, and it was like about two million dollars to rent the cluster to like deal with all the problems and headaches of training a model. So these models are pretty like a lot of people could get one to million dollars to train a model, but the recurring costs of serving millions of users is really billions of dollars of compute. I think that you can look at close like a thousand G_P_U_ rental you, can pay a hundred grand a day for, and these companies could have millions of G_P_Us. Like you can look at how much these things cost to sit around. So that's kind of a big thing and then it's like if scaling is actually giving you a better model, like is it gonna be financially worth it? And I think it'll kind of slowly will push it out as A_I_ solves more compelling past. So like the likes of Claude Opus four point five making, Claude code just work for things. I think I've I launched this project called like the Adam project, which is like American truly open models in July. And that was like a true vibe coded web site. And like I have a job um make plots and stuff. And then I came back to refresh it in the last few weeks and it's like Claude Opus four point five versus whatever model at the time was like just crushed all the issues that it had from building in June and July. And like it might be a bigger model there's, a other things that go into this, but that's like there's still progress coming. So so what you're speaking to is the nuance of the Y_ axis of the scaling laws, that the the way it's experienced versus on a benchmark, the actual intelligence is migh might be different. But still, your intuition about pre-training, if you scale the the size of compute, will the models get better Not? whether it's financially viable, but just from the law aspect of it. Do you think the models will get smarter? Yeah. And I think that there's and this sometimes comes off as like almost like disillusioned from people leadership, like AI companies saying this, but they're like it's held for thirteen orders of magnitude of computers, like why would it ever end? So I think fundamentally it it is pretty unlikely to stop. It's just like eventually we're not even gonna be able to test the bigger scales because of all the problems that come with more compute. I think that there's a lot of talk on how twenty twenty six is a year when very large Blackwell compute it's like gigawatt scale facilities, so hyperscalers are coming online. And these were all contracts for power and data centres that were assigned and sought out in like twenty two and twenty twenty three, so before or right after chat G_P_T_ So. it took this two to three year lead time to build these bigger clusters to train the models. Well there's obviously immense interest in building even more data centres than that. So that is like the kind of the crux that people are saying is like these new clusters are coming, the labs are gonna have more compute for training, going to utilise this but it's not a given and it's like I I've seen so much progress that I expect it and I expect a little bit bigger models and I expect um I would say it's more like we will see a two thousand dollar subscription this year we, see two hundred dollar subscriptions it's like that can ten X_ again and these are the kind of things that could come and they're all downstream of this like bit big bit bigger model that offers just a little bit more cutting edge. So you know it's reported that X_A_I_ is gonna hit that uh one gigawatt scale early twenty six and full two gigawatt by year and w how do you think they'll utilise that in the context of scaling laws? Is is a lot of that inference is a lot of that training? it ends up being all of the above. So I think that all of your decisions when you're training a model come back to pre-training. So if you're gonna scale R_L_ in a model, you still need to decide on your architecture that enables this. We were talking about like other architectures then uh using different types of attention. We're also talking about mixture of experts models. This sparse nature of M_O_E_ models makes it much more efficient to do um generation, which becomes a big part of um post-training. And it's like you need to have your picture ready so that you can actually scale up this compute. I still think most of the compute is going in at pre-training. Because you can still make a model better, you still want to go and revisit this. You will still want the best base model that you can. And in a few years that'll saturate and the the R_L_ compute will just go longer. Is there are people who disagree with you that say basically pre-training is dead, it's all about scaling inference, scaling pulse training, scaling context, continual learning, uh scaling data, synthetic data. People vibe that way and describe it in that way, but I think it's not the practice that is happening. this thing's dead. Yeah. Yes. Mm-hmm So. reasoning is if you get a new compute cluster that lets you do something maybe more stably or faster, 'cause like you hear a lot about Blackwell having roll-out issues where at A_I_ two most of the models were pre-training around like one to two thousand G_P_Us, but when you're pre-training on ten thousand or a hundred thousand G_P_Us you hit very different failures. So G_P_Us are known to break in weird ways, and doing a hundred thousand G_P_U_ run is like you're pretty much guaranteed to always have at least one G_P_U_ that is down. And you need to have your training code handle that redundancy, which is just a very different problem, where it's like well we're doing like oh I'm playing with post-training on a D_G_X_ Spark or you have your book it's like or people learning M_L_ it's like what they're battling to train these biggest models is just like mass distributed scale and it's a very different but that's somewhat different than like are these like that's a systems problem in order to enable the scaling laws especially of pre-training you need all these G_P_Us at once. When we shift to reinforcement learning it actually lends itself to heterogeneous compute because you have any copies of the model. And to do a primer for a language model, reinforced learning, what you're doing is you have two sets of G_P_Us. One is you can call it the actor, and one you call the learner. The learner is where your actual reinforced learning updates are gonna do. These are traditionally policy gradient algorithms, um proximal policy optimisation P_P_O_, and group relative pov policy optimisation G_R_P_O_, are the two popular classes. And w on the other side you're gonna have actors which are generating completions and these completions are the things that you're gonna grade. So reinforced learning is all about optimising reward. And in practice what you can do is that you can have a lot of different actors in different parts of the world doing different types of problems and then you send it back to this highly networked compute cluster to do this actual learning where you s where you take the p where you take the gradients and you need to have a tightly meshed network where you can do different types of parallelism and spread out your model for efficient training. So there's just like a lot of every different type of training and serving has these considerations you need to scale. Like we talked about pre-training, we talked about R_L_ and then inference time scaling is like how do you serve a model that's thinking for an hour to a hundred million users. I'm like uh I don't really know about that, but I know that's a hard problem and in order to give people this intelligence there's all the systems problems that we need more compute and you need more stable compute to do it. But you're bullish on all of these kinds of scaling is what I'm hearing. On the inference, on the reasoning, even on the pre-training. Yeah, so that's a big can of worms here. But uh so there are basically two the knobs are the training and uh uh inference scaling where you can get gains. And so in an in a world where we had let's say infinite compute resources, you wanna do all of them abo like so you have training, you have inference scaling, and training is like a hierarchy, it's pre-training, mid-training, post-training. Changing the model size, more training data, making training a bigger model gives you more knowledge in the model. The the model um let's say has a better it's like a better base model uh back in the day I mean still we call it foundation model and it unlocks so you un but you don't let's say have the model be able to solve your most complex task tasks during pre-training or after pre-training. You still have these other unlock phases where you have mid-training or non-context for example post-training with a L_R_V_R_ that unlocks capabilities that the model has in terms of just knowledge in the pre-training. And I think sure if you uh so do more pre-training you get a better base model that you can unlock later, but like Nathan said it just becomes too expensive. So we don't have infinite compute. So you have to decide, do I want to spend that compute more on making the model larger? But you know, it's like a trade-off. It's it's like in an ideal world you want to do all of them. And I think in that sense, scaling is still pretty much alive. You would still get a better model. But like we saw with G_P_D_ four point five, it's just not worth it. I mean it's like 'cause you can let's say you can unlock more performance with other techniques at that current moment. Especially um if you look at inference scaling, that's one of the biggest gains this year with um where it took a smaller model further than pre-training a larger model like G_B_D_ four point five. So it's it's like it I wouldn't say pre-training scaling is that, it's just like there are other more attractive ways to scale right now at the moment. But at some point, you know, will y you will still wanna make some progress on the pre-training. The thing is also to consider um where you w why do you wanna spend your money. If you spend it more on the pre-training, it's like a fixed cost, you train the model, and then it has this capability forever. You you can always user uh and so forth. With inference scaling, you don't spend money during training, you spend money later per query, and then it's also like the math how long is my model gonna be on the market if I replace that in half a year. Maybe it's not worth spending five million ten million hundred million dollars on the training it longer. Maybe it's just I will just do more inference scaling and get the performance from there. It maybe cost me two million in terms of user queries. It becomes a question of how many users you have and then doing the math um and I think that's also where it's interesting where J_G_B_D_ is in a position I think they have a lot of users where they need to go a bit cheaper, where they have that uh J_B_D_ five model that is a bit smaller. Other companies that have I say if your customers have other uh other um trade-offs for example there was also the math Olympiad or some of these these math uh problems where J_G_B_T_ uh or maybe they had a proprietary model and I'm pretty sure it's just like a model that has been maybe fine-tuned uh a little bit more, but most of it was doing inference scaling to achieve speak performance in certain task where you don't need that all the time. And but yeah, long story short, I do think all of these uh pre-training, mid-training, post-training, uh inference scaling, they are all still things you wanna do. It's just finding uh at the moment in this year it's finding the right ratio that gives you the best bang for the buck basically. I think this might be a good place to define pre-training, mid-training, and post-training. So pretraining is the classic training of one next token prediction at a time. You have a big corpus of data and uh Nathan also has very interesting insights there because of OMO three. It's a big portion of the paper focuses on the right data mix. So pretraining is essentially just you know train across entropy loss training on next token prediction on a uh vast corpus of internet data books papers, and so forth. It has changed a little bit over the years in a sense people used to throw in everything they can. Now it's not just raw data, it's also a synthetic data where people re um let's say rephrase certain things uh so synthetic data doesn't necessarily mean purely A_I_ made up uh data, it's also taking something from an article Wikipedia article and then rephrasing it as a Q_ and A_ question or um summarising it, reverting it and and making uh better data that way 'Cause. I think of it also like with humans, if someone let's say reads a book compared to a I dunno, an offence but like Reddit posts or something like that. I do think you learn you n no offence uh but I think There's gonna be a post about this. So I should I data is very coveted and excellent for training. You just have to filter it. I I think that's the idea Uh. I I think it's like w if someone took that and rephrases that in a s let's say c a more concise and structured way, I think it's higher quality data that gets the L_M_ maybe the same uh you get the same L_M_ out of it at the end, but it gets there faster. It trains faster because the let's say if the grammar and the punctuation is correct, it already learns the correct way versus getting information from a messy way and then learning later how to correct that and stuff like that. So I think that is how pre-training evolved and how um w how still while w why s scaling still works is that it's not about just amount of data, it's also the tricks to make that data b better for you in a sense. And then mid-training is I mean it used to be called uh pre-training, it's I think it's called mid-training because it was awkward to have pre-training and post-training but nothing in the middle, right, it sounds a bit weird you have pre-training and post-training, but what's actual training. So the mid-training is usually similar to a pre-training, but you know, it's a bit more, I would say, specialized in pre-training. It's the same algorithm, but what you do is you focus, for example, on long conta uh like it's uh one example, you have long context documents. The reason you don't do that during just pure pre-training is because you don't have that many long context documents. So you have a specific phase, and one problem of L_E_M_S_ is also still it's a neural network. It has the problem of catastrophic forgetting. So you teach it something, it forgets other And you wanna it's not a hundred percent forgetting, but you know, it's like no free lunch, you can't it's also the same with humans. If you ask me some math I learned ten years ago, I don't know, I would have to look at it again. Uh Nathan was actually saying that he's consuming so much content that there is a catastrophic forgetting issue. Yeah, I'm like trying to learn so much about A_I_M_ it's like I was learning about pre-training parallelism, I'm like I lost something and I don't know what it was. Mm-hmm. Mm-hmm. pre-training but, I I mean I don't think anyone does that in production. Toy examples for now huh, But? to generalise parallel uh post-training is more like the skill unlock where pre-training is like soaking up the knowledge essentially. Um a few things that could be helpful for people. A lot of people get like they have think of synthetic data as being bad for training the models. You mentioned like the deep sea get it almo uh s O_C_R_, which is optical character recognition paper. A lot of labs did. A_I_ two had one that look had multiple and the reason that each of these labs has these is because there's vast amounts of P_D_F_s and other digital documents on the web that are in formats that aren't encoded with text easily. So you use these almost C_R_ these or C_C_O_C_R_, and we called our L_M_O_C_R_ to extract what can be trillions of tokens of um candidate data for pre-training. And pre-training data set size is on the order of trillions is measured in trillions of tokens. Smaller models from researchers can be something like five to ten trillion, um QUINN_ is documented going up to like fifty trillion, and there's rumors that these closed labs can go to like a hundred trillion tokens. And just getting this potential data to put in, I think they they have a very big funnel, and then the data you actually train the on is a small percentage of this, like the sin this character recognition data would be described as synthetic data for pre-training in a lab. And then there's also the things like chat G_P_T_ now gives wonderful answers and you can train on those best answers and that's synthetic data. It's very different than like early chat G_P_T_ lots of hallucinations data when people became grounded in synthetic data. One interesting question is if I recall correctly, almost three was trained with less data than uh specifically some o other open-weight models, maybe even almost two, but you still have better performance and that might be one of the examples how the data help. It's mostly down to data quality. I think if we had more compute we would train for longer. I think we'd ultimately see that as a like just like something we would want to do. And especially with big models you need to have more compute because we talk about having more parameters and we talk about knowledge. And essentially there's a ratio where big models can absorb m more from data and then you're gonna d you get more benefit out of this. It's it's it's like one of these any logarithmic graph in your mind is like a small model will level off sooner if you're measuring trends of tokens and bigger n bigger models need more. But is we aren't training that big of models right now with A_I_ two and getting the highest quality data we can is the natural starting point. Is there something to be said uh about the topic of data quality? Is there some low hanging fruit there still where the quality could be improved? it's like turning the crank. So I think historically in the open there's been like a canonical best pre-training data set that has moved around between who has the most recent one or the best or the recent effort like A_I_ two's, Dolmo was very early with the first Olmo and hugging face had fine web and there's a D_C_L_M_ project which has been kind of like a which is it stands for data comp language model there's, been data comp for other machine learning projects and they have a v had a very strong data set and a lot of it is the internet is coming fairly closed off so we have common crawl which I think is hundreds of trillions of tokens and you filter it and it looks like being a lot of scientific work where you're training classifiers and making decisions based on how do you prune down this deci this data set into the highest quality stuff and the stuff that suits your tasks. So previously language models were tested a lot more on like knowledge and just kind of conversational things but now they're expected to do math and code. So to train a reasoning model you need to remix your whole data set and there's a lot of actually wonderful methods here where you can you can like take your gigantic data set, you sample a lot of really tiny things from different sources. So you say you have GitHub, stack exchange, Reddit, Wikipedia, you can sample small things from them and you train small models on each of these mixes and measure their performance on your evaluations. And you can just do like basic linear regression and it's like here's your optimal data set. But if your evaluations change, your data set changes a lot. So a lot of old mode three was new sources for reasoning to be better at math and code t and then you do this mixing procedure and it gives you the And I think that's a lot of that's happened at labs this year. It's like there's new hot things, whether it's like coding environments or web navigation, and you just need to bring in new data. You need to change your whole pre-training so that your post-training can work better and stuff like this. And that's like the constant re re-evolution and the re-determining of what they care about as their for their models. Are there fun anecdotes of what sources of data are particularly high quality that we wouldn't expect? You mentioned Reddit sometimes can be a source. Reddit was very useful. I think that um like P_D_F_s is definitely one. Or especially archive. Yeah so like A_I_ two has run Semantic Scholar for a long time which is a um like a what you can say as a competitor to Google Scholar with a lot more features and to do this A_I_ two has found and scraped a lot of P_D_F_s for openly accessible papers that might not be um like behind the pl closed paid garden of a certain publisher. So like truly open scientific P_D_F_s and if you like you sit on all of these and you process it and you can get value out of it and I think that like a lot of that style of work has been done by the the frontier labs much earlier and it's just like you need to have a pretty skilled researcher that understands how things change models and they bring it in and they clean it and it's that's a lot of labour that like I think of a lot of frontier labs when they scale researchers a lot more it goes into data. You have people like if you wanna m if you join a frontier lab and you wanna have impact the best way to do it is just make find new data that's better. And then like the fancy glamorous algorithmic things like figuring out how to make O_ one is like the sexiest th thought of a Yeah. and there's a group that did that but I think of most of the contributions is like I'm gonna make the data better or I'm gonna make the infrastructure better so that everybody in my team can run experiments five percent faster. only licensed data where common crawl is a scrape of like the whole internet. So um if I I host multiple websites I'm m happy to have them train language models but I'm not explicitly licensing what governs it and therefore this lis the common crawl is largely unlicensed which means that your consent really hasn't been provided for how to use the data. There's another idea where you can train language models only on data that has been licensed explicitly so that the kind of governing contract is provided and I'm not sure if APRIS is the right thing or the license thing. I know that the reason that they did it was for an E_U_ compliance thing where they wanted to make sure that their model um fit one of those checks. said they m just purchased the license. I'd say they p buy a book online, uh let's say an Amazon Kindle book or let's say a mining book or something and then use that in the training data. And that is like the grey zone because you paid for the content and you might wanna train it. but then there are also restrictions where even that shouldn't be allowed and so that that is like where where it gets a bit fuzzy and w yeah I think that is right now it is still a hot topic and also big companies like OpenAI they approached private companies for their proprietary data and private companies they've become more and more let's say uh uh protective of their data because they know okay this is gonna be my mode in in a few years and I do think um that's like the interesting question where if L_M_M_s become more commoditized, and I think a lot of people learn about L_M_M_s, uh there will be a mo lot more people able to train L_M_M_s. Of course there are infrastructure challenges, but if you think of big industries like uh pharmaceutical industries, law, finance industries, I do think they at some point will hire people from other uh frontier labs to build their in-house models on their proprietary data, which will be then again another unlock with pre-training that is currently not there, because even if you wanted to, you can't get that data you, can't get access to trials most of the time and these types of things. So I I do think m scaling in that sense might be still pretty much alive if you also look in s domain specific applications, because we are still right now in this year just looking at general purpose L_M_M_s on on G_G_P_D_, anthropic and so forth. They are just general purpose. They're not even I think s scratching the surface of what an L_M_M_ can do if it is really specifically trained and designed for a specific task. I think on the data thing some this is one of the things where like this happened in twenty twenty five and we totally forget it is, enthropic lost in court and was owed a one point five billion dollars to authors, enthropic I think bought thousands of books and scanned them and was cleared legally for that because they bought the books and that is kind of going through the system. And on the other side they also torrented some books and I think this torrenting was the path where the court said that they were then culpable to pay this billions of dollars to authors which is just like such a mind boggling lawsuit that kind of just came and went, like that is so much money from the V_C_ ecosystem. These are core cases that will define the future of human civilization 'cause it it's clearly that data drives a lot of this and there's this very complicated human tension of I mean you can empathize, you're both authors. Yeah, there's some degree to which I mean you put your heart and soul and your your sweat and tears into the r the writing that you do, uh it feels a little bit like theft for somebody to train your data without giving you credit. two layers to it. Someone might buy the book and then train on it, which is uh it could be argued fair or not fair, but then there are the three straight-up um companies who use pirated books where it's not even compensating the author. It's that that is I think where people got a bit uh angry about it specifically. there has to be some kind of competition scheme. This is like moving towards towards something like Spotify streaming did originally for music. You know, what does that competition look like? You have to define those kinds of models, you have to think through all of that. Uh one other thing I think people are generally curious about, I'd love to get your thoughts, as L_M_s are used more and more, if you look at even archive, but GitHub, more and more of the data is generated by L_L_M_s, what do you do in that kind of world? Is how big of a problem is that? Well there are just problems in infrastructure and systems, but from an A_I_ point of view it's kind of inevitable. So it's basically L_L_M_ generated data that's curated by humans essentially, right? Yes, and I think that a lot of open source contributors are legitimately burning out. If you have a popular open source repo, somebody's like oh, I wanna do open source A_I_ it's good for my career, and they just vibe code something and just and they throw it into the p you might get more of this than I do. So I have a c a case study here um that I have a uh repository called M_L_X_T_END uh that I developed as a student around fifteen years ten years ago. And it is a reasonably popular library still for certain algorithms I think, especially like frequent data mining stuff. And there was recently I think two or three people who submitted a lot of P_R_s in a very short amount of time. I do think L_M_s have been involved in submitting these P_R_s. Me as the maintainer, there are two things. First I'm a bit overwhelmed, like I I don't have time to through it because especially it's an older library that is not a priority for me. At the same time I kind of also appreciate it because I think something people forget is it's not just using the L_L_M_ there's still a human you have a human layer that verifies something and uh and that is in a sense also how data is labelled right, So. that's like um one of the most expensive things is get getting labelled data for R_L_ back um human feedback uh phases. And and this is kind of like that where it goes through phases and then you get actually higher quality data out of it, you know. It's so I I I don't mind it in a sense. It it can feel overwhelming. But I do think there is also value in that. It feels like there's a fundamental difference between raw L_L_M_ generated data and L_L_M_ generated data with human in a loop that does some kind of verification, even if that verification is a small percent of the the lines of code. I think this goes with anything um like where people think also sometimes oh, y I can just use an L_L_M_ to learn about X_Y_Z_ which is true, you can, but there might be a person who is an expert who might have used an L_L_M_ to write s s specific code. There is kind of like this human work that went into it to make it nice and throwing out the not so nice part, to make it to kind of like pre-digest it for you and that saves you time. And I think that's uh it's that that's the value add where you have someone things or even using the L_L_M_s correctly. I think this is still labour that that you get for free with you for, example, read an article, let's say sub-stake article. I could maybe ask an L_L_M_ to give me opinions on that, but I wouldn't even maybe know what to ask. I think there is still value in reading that article compared to me going to the L_L_M_ because you are the expert, you select what knowledge is actually spot on should be included and you give me this very this this um uh executive summary. And and this is kind of uh huge value at because now I don't have to waste three, five hours to go through this myself, maybe get some incorrect information and so on and so I think that's also where the future still is for writers, even though there are uh L_L_M_s that expert can kind of like save your time. It's kinda fascinating to actually watch uh did uh I'm sure you guys do this, but uh for me to look at the difference between the summary and the original content, even if it's a page long summary of a page long content, it's interesting to see how the summary L_M_B_ summary takes the edge off, like what what is the signal it removes from the thing. The voice is what I talk about a lot. voice well voice uh I would love to hear what you mean by voice that's really powerful, but sometimes there's like literally insights. Like in removing an insight you're actually fundamentally changing the meaning of the thing. So I'm uh continuously disappointed how bad L_M_s are at really getting to the core insights, which is what a great summary does. Y even if you go and I have these extensive extremely elaborate prompts where I'm like really trying to dig for the And it's still not quite there, which um I mean that's a whole deep philosophical question about what is human knowledge and wisdom and what does it mean to be inside phone and so on. But when you talk about the voice, what do you mean? So when I write, I think a lot of what I'm trying to do is take what you think as a researcher, which is very raw, which a researcher is trying to encapsulate an idea at the frontier of their understanding, and they're trying to put what is a feeling into words. And I think that my writing, I tried to do this as the writing, which makes it come across as raw, but also high information in in a way that it's like some people will get it and some won't, and that's kind of the nature of research. And I think this is something that language models don't do well, particularly they're all trained with this reinforcement learning from human feedback, which is designed to take feedback from a lot of people and in a way average how the model behaves from this. And I think that there's it's going to be hard for a model to be very incisive when there's that sort of filter in it. And I think this is kind of a wonderful fundamental problem for researchers in R_L_H_F_ is like this provides so much utility in making the models better, but also the f problem formulation is kind of like there there's this not in it that you can't get past. So that's what I think of as like these language models don't have this prior and they're deep expression that they're trying to get at. I don't think it it's impossible to do. I think they're stories of models that really shock people. Like I think of like I would love to have tried Bing Sydney. And does like does that have more voice? 'Cause it would so often go off the rails on people and infi and what is historically obviously a scary way, like telling a reporter to leave his wife is a crazy model to potentially put in general uh general But that's kind of like a trade-off like is this R_L_H_F_ process like in some ways adding limitations. That's a terrifying place to be as one of these frontier labs and and companies because millions of people are using them. There was a lot of backlash last year with the G_P_T_ four O_ getting removed and I personally never used the model but I've talked to people at OpenAI where they're to the point where they like get emails from users that might be detecting subtle the differences in the deployments in the middle of the night and they email them and they're like my friend is different and they like find these people employees emails and send them things because they are so attached to this set but there's a set of model weights and a configuration that is deployed to the users. We see this with TikTok. You open it s I I don't use TikTok. Supposedly in like five minutes the mo the algorithm gets you. It's like it's locked in. And I don't s like those are language models doing recommendations. Like I think there are ways that you could do this with a language model. Within like five minutes of chatting with it the model just gets you. And that is something that people aren't really ready for. Like I think that if like kid like don't give that to kids like don't give that to kids, at least until we know what's happening. Mm-hmm. do is they will say, well, the suicide was committed because of the L_O_M_. And that's going to lead to the companies because of legal issues and so on, more and more and more taking the edge off of the L_O_M_. So it's going to be as generic as possible. It's so difficult to operate in this space because of course you don't want an L_O_M_ to cause harm to humans at that level. But also this is also the nature of the human experience is to have a rich conversation, a fulfilling conversation and one that challenges you from which you grow, you need that edge. And that that that's something that's extremely difficult for A_I_ researchers on the R_L_H_F_ front to actually have to solve. 'Cause you're actually s dealing with the human condition. Like a lot of researchers at these companies are so well motivated and there's definitely the the like cementropic and opening eye are culturally so want to do good through this for the world and there is it's such a the c I'm like I don't wanna work on this because on the one hand a lot of people see A_I_ as a health ally as somebody they can talk to about their health confidentially, but then it bleeds all the way into this l like talking about mental health and things where that's a s it's that this will push like be the thing where somebody goes over the edge but, other people might be saved and I'm like I don't like uh there's things that as a researcher training models it's like I don't wanna train image generation models and release them openly 'cause I don't wanna enable somebody to have a tool on their laptop that can harm other people like I don't have the infrastructure at my company to do that safely but, it's like like there's a lot of the areas like this where it's just it needs people that will approach it with the complexity and just kind of conviction of like s it's just to heart problem. But also we as a society as users of these technologies need to make sure that we're having the complicated conversation about it versus just fear-mongering. Big tech is is causing harm to humans or stealing your data, all that kind of stuff. There is is more complicated than that, and you're right. There's a very large number of people inside these companies, many of which you know, many which I know that deeply care about helping people. They are considering the full human experience of people from across the world, not just Silicon Valley, people across the United States, people across the world what that means, what their needs are. It's really difficult to design this one system that is able to help all these different kinds of people across the different age groups, cultures, mental states mental, conditions, all that kind of stuff. I wish that the timing of A_I_ was different with the relationship of big tech to the average person. So like big tech's reputation was so low and with how A_I_ is so expensive it's like inevitably gonna be a big tech thing where it takes so many resources and people say that U_S_ is quote unquote betting the economy on A_I_ with this build-out and it's like to have these be intertwined at the same time is just makes for such a hard communication environment. It would be good for me to go talk to more people in the world that hate big tech and C_A_I_ is a continuation of this. And one of the things you actually recommend, one of the antidotes that you talk about is uh to find agency in this whole system, as opposed to sort of sitting back in a powerless way and consuming the A_I_ slop as it quickly rapidly takes over the internet. More find agency by using it to build stuff build, apps build. So you one that actually helps you build the intuition, but two it's empowering because you you can understand how works, what the weaknesses are, and it allows it gives your voice power to say like this is fucked up, this is bad, this is bad, use of the technology, and this is good use of technology. And you're more plugged into the system than so you can understand it better and you can steer it better as as a as if it's you. is a good point you brought up agency. Instead of ignoring it and saying okay, I'm not gonna use it, I think it's probably long-term healthier to say okay, it's out there, I can't put it back, you know, like internet computers back then when they came out. How do I make best use of it and how does it help me to up-level myself? The one thing I worry here though is like if you just fully use it for something you love to do, the the thing you love to do is not no no longer there, and that could potentially I feel like lead to burn-out. For example if I use an them to do all my coding for me. Now there is no coding, I'm just managing something that is coding for me. Two years, let's say later, if I just do that eight hours a day, have something code for me, do I feel fulfilled still? Like is this like yeah, I mean is this like hurting me in terms of being excited about my job, excited about what I'm doing, am I still proud to build something? So there's uh on that topic of enjoyment, it's c it's quite interesting we should just throw this in there that there is this recent survey of about seven hundred and ninety one professional developers, professional meaning ten plus years of experience. That's a long time. Yeah. Uh yeah, in this day and age. Uh so the the results here on many fronts are uh surprising. So they break it down by junior and senior developers. But I mean it just shows that both junior and senior developers u u use A_I_ generated code in code they ship. So this is not just for fun sort of intermediate kind of learning things. This is code ship and so it's twenty five percent meant like most of them use around fifty percent or more. And what's interesting is for the category of over fifty percent of your code that you ship as A_I_ generated, senior developers are much more likely to do so. But you don't want A_I_ to take away the thing you love. I think it speaks to my experience these particular results I'm about to say. So together about eighty percent of people find it either somewhat more enjoyable or significantly more enjoyable to use A_I_ as part of the work? I think it depends on the task um from my personal uh usage for example. I have a website where I sometimes tweak things on the website. I personally don't enjoy this. So in that sense if the AI can help me to implement something on my website, I'm all here for it. It's it's great. But then at the same time when I solve a complex problem, well if there's a bug and I hunt this bug and uh I find the bug, it's the best feeling in the world. It's like you get so much joy like oh it's like you feel like great. But now if you don't even think about thinking about the bug, you just go directly to the L_L_M_, well you never have this kind of feeling, right. But then there could be the middle ground where well you try yourself, you can't find it, you use the L_L_M_ and then you don't get frustrated because it helps you and you move on to something that you enjoy. And so I think looking at these statistics, I think also the difference is what is not factor in its averaging over all the different scenarios where we don't n so we don't know if it's for the core task or if it's for something mundane that people would not have enjoyed uh otherwise. So in a sense A_I_ is really great for doing mundane things that um take a lot of work. Um so for example my wife the other day uh she has like a podcast for like book uh like book discussions, a book club, and she was like transferring the show notes from um Spotify to YouTube, and then the links somehow broke. Uh and she had in some episodes because it is custom many books like hundred links or something, and it would have been really painful to go in there and fix each link manually. And so I suggested hey let's try chat G_P_T_, we copied the text into chat G_P_T_ and it fixed them and it instead of two hours going from link to link fixing that, you know, it made that type of work much more seamless, there was no frustration fixed. I think everyone has a use case where A_I_ is useful for something like that that would be really boring really, mundane. I for me personally since we're talking about coding uh and you mentioned debugging uh would b a lot of the sources of enjoyment from me on more on the cursor side than the clock code side is the I have a friend, I have a co what's that called a, pair programmer. Like uh it's less l lonely. You you made debugging sound like this great joy No. I would say I would say debugging is like a drink of water after you've been going through a desert for for for days. So like you're you skip the whole desert part where you're suffering. So like there sometimes it's nice to have a friend who's who's can't really find the bug but can give you some intuition about the code, and you're together uh the with that friend going through the desert and then together find that drink of water. So I at least from me uh maybe speaks of the loneliness of the programming experience. It's uh that is a source of joy. Mm-hmm. Mm-hmm. there if you can solve it then it's great. But there's also like a sweet Goldilocks zone where if it's too hard then it's you know wasting your time. But uh I think that is another challenge though. How will people learn? I mean the chart we looked at um we saw that more senior developers are shipping more A_I_ generated code than the junior ones and I think it's uh very interesting because intuitively you would think it's the junior developers because they don't know let's say how to do the thing yet because they are more junior and so they use A_I_ to do that thing. It could either mean the A_I_ is not good enough yet to solve that task, but it could also mean experts are more effective at using it. They know where and better how to use it and review the code, and they trust the code then more. And so I think o one issue in the society in the future will be though, how do you become an expert if you never try to do the thing yourself. And I think one way it's always like for me how how I learn is by trying things myself, like math textbooks, if look at the solutions. Yeah, you learned something, but I think you learn actually better if you try first and then you appreciate the solution differently because you know how to put it into your mental framework. And um if L_M_s are here all the time, would you actually go through the length at struggling? Would you s be willing to struggle? Because struggle is not nice, right. I mean it's struggling. And if you use the L_M_ to do everything at some point, you will never really take the next step. And then you will maybe not get that unlock that you get as an expert using an L_L_M_. So mm it's like you know it's like I think there's like a Goldilocks sweet spot where maybe th maybe the trick here is you make dedicated off-line time where you study two hours a day and the rest of the day use L_L_M_s. But uh I th I think it's important also for peoples to still invest in themselves in my opinion to not just you know L_L_M_ everything. Yeah there is and uh we together as civilization that we each individually have to find that Goldilocks own uh and in the program and context as developers. Now we had this fascinating conversation that started with pre-training and mid-training. Let's get to post-training. A lot of fun stuff in post-training. So what are some of the interesting ideas in post-training? Mm-hmm. lot of this kind of iterative generate grade loop, and that lets the models learn both interesting behaviours on the tool use and software side. This could be searching, running commands on their own and seeing outputs, and then also that training enables this inference time scaling very nicely. And it just turned out that this paradigm was very nicely linked in this, where it's this kind of R_L_ training enables inference time scaling. But inference time scaling could have been found in different ways. So it was kind of this perfect storm of the models change a lot, and the way that they're trained is a m major factor in doing so, and this has changed how people approach post-training dramatically. Can you describe R_L_V_R_ popular by deep seek R_ one? Can you describe how it works? Yeah, fun fact, um I was on the team that came up with the term R_L_V_R_, which is from our two to three work before deep seek, which is we don't take a lot of credit for the being the people to popularize the scaling R_L_, but it is fun as what academics get as an aside is the ability to name and influence the discourse, because the closed labs can only say so much that one of the things you can do as an academic is like you might not have the compute to train the n the model, but you can frame things in a way that ends up being I, but it's like a community can come together around this R_L_V_R_ term which is very fun. And then deep seek is the people that did the training breakthrough which is they scaled the reinforcement learning which was you'd have the model generate answers and then grade the completion if it was right, and then that accuracy is your reward for reinforcement learning. So reinforcement learning is classically an agent that acts in an environment and the environment gives it a state and a and a reward back and you try to maximise those reward. In the case of language models, the reward is normally accuracy on a set of verifiable tasks, whether it's math problems, coding tasks, and it starts to get blurry with things like factual domains, like that is also in some ways verifiable or constraints on your instruction, like respond only with sent words that start with A_. Like all of these things are verifiable in some way and the core idea of this is you find a lot more of these problems that are viable and you let the model try it many times while taking these R_L_ steps, these R_L_ gradient updates, the infrastructure evolved from this reinforced learning from human feedback, where in that era the score they were trying to optimise was a learned reward model of aggregate human preferences. So you kind of change the problem domains and that let the optimisation go on to much bigger scales which kind of kick-started a major change in what the models can do and how people use them. What kind of domains is uh R_L_V_R_ manageable to? math and code are the famous ones, and then there's a lot of work kind of on s what is called the rubrics, which is related to a word people might have heard as L_M_ as a judge, which is like for each problem I'll have a set of problems in my training data set. I'll then have another language model and ask it what would a good answer to this problem look like. And then you could try the problem a bunch of times over and over again and assign a score based on this rubric. So that's not necessarily verifiable like a math and code domain, but this rubrics and other scientific problems that it might be a little bit more vague is where a lot of the attention is, where they're trying to push this set of methods into these kind of uh more open-ended domains so the models can learn a lot more. I think that's called reinforcement running with A_I_ feedback, right? That's the older term from it that was coined in enthropics constitutional A_I_ paper. So it's like a lot of these things come in cycles. also just one step back for the R_L_L_V_R_. So I think the interesting, beautiful thing here is that you ju you ask the L_M_ a c let's say a mass question, and then you know the correct answer. And you let the L_L_M_ like you said figure it out. But how it does that mm I mean you don't really constrain it much. There are some constraints you can add like use the same language, don't switch between Spanish and English. But let's say you're pretty much hands off, you only give the question and the answer. And then the L_M_M_ has to you know just the task to arrive at the uh uh right answer. But the beautiful h thing here is what happens in practice is that the L_M_ will do a step-by-step description. Like you know like as a student or like as a m yeah, m mathematician how you would derive the solution. It will give you well it will use those steps and that helps actually the model to improve its own accuracy. And then like you said the inference scaling. So inference scaling loosely means basically spending more compute during using the L_M_ during inference. And here the inference scaling is that the would use more tokens. And and also I think in the R_ one paper they showed the longer they train the model, the longer the responses are. They they grow over time. They use more tokens. So it becomes more expensive becomes more expensive for simple tasks. But these explanations they, help the model with the accuracy. There are also interesting lot uh lot of papers showing what the model explains does not necessarily have to be correct. Or it maybe it's even unrelated to the answer. But s for some reason it still helps the model. Like this is the fact that it is um explaining. And I think it's also again, I wanna anthropomorphize these L_M_M_s, but it's kinda like how we humans operate, right. If there's a complex math problem, let's say in a math uh class, you you usually have a notepaper and you do it step by step, you cross all things. And the model also self-corrects and that that was I think the AHA moment in the R_ one paper, they called it AHA moment because the model itself recognised it, made a mistake and then said ah, I did something wrong and so let me try and I think that's just so cool that this falls out of just giving it the correct answer and having it out how to do it that it kind of does in a sense what a human would do although, L_M_s don't think like humans it's kind of like an interesting coincidence and it and the the other side a nice side effect is it's great for us humans often to see these steps it builds trust, but also we learn we can double-check things. there's a lot in here I think some of the debate there's been a lot of debate this year on if the language models like these aha mo I think the aha moments are kind of fake because then pre-training you essentially have seen the whole internet so you have definitely seen people explaining their work even even verbally like a transcript of a math lecture you try this oh I mess this up and what reinforcement learning is this R_L_V_R_ is very good at doing is amplifying these behaviours 'cause they're very useful in enabling the model to think longer and to check its work and I agree that it is very Mm-hmm. that this training kind of the model learns to amplify this in a way that is just so useful at the final answers being better. I can give you also a hands-on example I was training the GRAND three base model with R_L_V_R_ on math five hundred the base model had an accuracy of about fifteen percent just fifty steps like in a few minutes with R_L_V_R_ the model went from fifteen percent to fifty percent accuracy and the mo you can't tell me it's learning anything about fundamentally about math in so many p there's been two papers this year one of which I was on that talks about data contamination in QUIN and specifically that they train on a lot of this special mid-training phase that we just have like a minute on and it's weird so they train on problems that are almost identical to math. Mm-hmm. Mm-hmm. where there's been multiple papers talking about contamination it's like how much can you believe them and I think this is what caused the reputation of R_L_V_R_ being about formatting because you can get these gains so quickly and therefore it must already be in the model. But there's a lot of complexity here that we it's not really like controlled experimentation so you don't really know. uh if it were weren't true um, I would say distillation wouldn't work right, I mean distillation can work to some extent, but the the thing is th that is I think the biggest problem and and I'm researched this contamination because we don't know what's in the data, it's r unless you have b a new data set i it's really impossible. And uh the same uh you mentioned um math uh the math data set which is you have a question and an answer and an explanation is given. But then also even something simpler like uh M_M_L_U_ which is a multiple choice um benchmark, if you just change the format slightly um like I don't know you, use uh pr a dot instead of a parenthesis or something like that. The model accuracy will vastly differ. I think that that could be like a model issue rather than a general issue. It's not even malicious by the developers of the N_M_ like hey we wanna cheat at that benchmark, it's just it has seen something at some point and I think the only fair way to evaluate an N_M_M_ is to have a new benchmark that is after the cut-off date when the N_M_M_ was deployed. Can we lay out what would be the sort of the recipe of all those things that would be going to post-training and you mentioned R_R_L_V_R_ was a really exciting effective thing, maybe we should elaborate R_L_H_F_ still has a really important component to play. What kind of other ideas are there on post-training? Mm-hmm. Mm-hmm. Mm-hmm. Mm-hmm. Mm-hmm. Mm-hmm. of all forming together, but to summarise it's like mid-training is give the model the skills it needs to then learn R_L_ and verify the rewards is let the model try a lot of time, so put a lot of compute into trial and error learning across hard problems and then R_L_H_F_ would be like finish the model make, it easy to use and kind of just round the model out. Can you comment on the de-model compute required for R_L_V_R_? It's only gotten up and up, so I think GRAC four w was famous for saying they use a similar amount of compute for pre-training and post-training. Back to the scaling discussion, they involve very different hardware for scaling. Pre-training is very compute bound, which is like this FLOPS discussion, which is just how many matrix multiplications can you get through one time. And because R_L_ you're generating these answers, you're trying the model in the real world environments, it ends up being much more memory bound because you're generating long sequences and the attention mechanisms have this behaviour where you get a quadratic increase in memory as you're getting to longer sequences. So the compute becomes very different. So you when in pre-training we would talk about a model I think if we go back to like the by the administration executive order it's like ten to the twenty fifth flops to train a model. If you're using flops in post-training it's a lot weirder because the reality is just like how many hours are you allocating how many G_P_U_s for. And I think in terms of time the R_L_ compute is getting much closer because you just can't put it into one system. Like pre-training is so computationally dense where all the G_P_Us are talking to each other and it's extremely efficient, where R_L_ has all these moving parts and it can just take a long time to generate a sequence of a hundred thousand tokens. It c like if you think about G_B_T_ five point two pro taking an hour, it's like what if your training run has a sample for an hour and you have to make it so that's handled efficiently. So I think in G_P_U_ hours or just like wall clock hours, the R_L_ runs are probably approaching the number of days as pre-training, but they probably aren't using as many G_P_Us at the same time. There's rules of thumb where in labs it's like you don't want your pre-training runs to last more than like a month because they fail catastrophically. And if you were planning a huge cluster to be held for two months and then it fails on day fifty, the opportunity cost is just so big. So you kinda don't wanna just y d people don't wanna put all their eggs in one basket, which is like G_P_T_ four was like the ultimate yellow run and nobody ever wanted to do it before where it took like three months to train and everybody was shocked that it worked, where I think people are a little bit more is an incremental now. So R_L_V_R_ is more let's say un unlimited how much you can train or get still benefit where R_L_H_F_ because it's a preference tuning it you reach a certain point where it doesn't really make sense to spend more R_L_L_ budget on that. So just uh step back with um preference tuning. So there are multiple people that can give multiple let's say explanations for the same thing and they can both be correct, but at some point you learned a certain style and it doesn't make sense to you know iterate on it. My favourite example is like if r relatives ask what laptop they should buy, I give them an explanation, ask them like yeah, what is your um use case, like they for example mm prioritize battery life and storage. Other people like us for example we would prioritize RAM and uh compute. And so but both both answers are correct, but different people require different answers. And with preference tuning, well you're trying to average somehow, like you are um asking the data labelers to give you the right well not the right the preferred answer, and then you train on that, but at some point yeah, you learn that average preferred answer in uh and there's no I think reason to keep training longer on it because you know it's just a style where with R_L_V_R_ you literally give the model well you let the lo model solve more and more complex uh difficult problems and so I think that it makes more sense to allocate more budget long term to R_L_R_V_R_. And also that right now we are in L_L_R_V_R_ one point O_ land where it's still like that simple uh thing where we have a question and answer, but we don't do anything with the one in between. So there was a I mean multiple research papers also by Google for example on process reward models that also give scores for the explanation how correct is the ex explanation and I think that will be the next thing, let's say R_L_V_R_ two point O_ for this year. Focusing in between question and answer like how to leverage that information, the explanation to improve the explanation and help it to get better accuracy. But then uh so that that's one angle. And there was a deep seek math two paper where they also had interesting uh inference scaling there where well first they had um developed models that grade themselves a separate model and, I think that that will be one aspect and the other like Nathan mentioned it, will be for L_R_ we are branching into other domains. the place where people are excited are value functions, which is ver pretty similar. So process reward models are kind of like process reward models assign how good something is to each kind of intermediate step in a reasoning process where value functions apply value to every token the language model generates. Both of these have been largely unproven in the language modelling in this reasoning model era. People are more optimistic about value functions forever for whatever reason now. I think process word models were tried a lot more in this pre-O_ one, pre-reasoning model era and a lot of people had a lot of headaches with them. So I think a lot of it is the human nature of like value models have a very deep history in reinforcement learning. They're one of the first things that were core to like deep reinforced learning existing is like training value models in this. So right now the literature people are excited about trying value models, but there's very little proof in it. And there are negative examples in trying to scale up process word models. These things don't always hold in the Mm-hmm. Mm-hmm. plot like this. But there's no scaling law for R_L_H_F_ where if you long increase the compute, you get some performance. In fact the seminal scaling paper for R_L_H_F_ is scaling laws for reward model over-optimization. So it's like that's a big line to draw with R_L_V_R_ and the methods we have now and in the future, like they will follow the scaling paradigm which is like the best runs you can let to run for an extra ten X_ and you get a few X_ performance, but you can't do this with R_L_H_F_. And that is just gonna be field defining and people approach them, where I'm a shill for people academically to do R_L_H_F_. And that's a good way to describe it, is like to do the best R_L_H_F_ you might not need the extra ten or a hundred X_P_ compute, but to do the best R_L_V_R_ you do, so I think there's a what I say is a seminal paper from what was a meta-internship is call it's like the art of scaling reinforcer learning with language models, they're what they describe as a framework is scale R_L_ and their incremental experiment was like ten and be two hundred hours, which is like thousands or tens of thousands of dollars per experiment, and they do a lot of them, which is just like this cost is not accessible to the average academic, which is a hard equilibrium where it's trying to figure out how to learn from each community. I was wondering if we could take at this point a bit of a tangent and talk about education and learning. If you're somebody listening to this who's a smart person interested in programming interested in A_I_ so, I presume building something from scratch is a good beginning. So can you just take me through like what you would recommend people do? I would personally start, like you said, uh implementing a simple model from scratch that you can run on your computer. The goal is not if you build a model from scratch to have like something you use every day for your personal projects, like it's not gonna be your personal assistant replacing an existing open-weight model or J_G_P_D_. It's to see what exactly goes into the L_M_M_, what exactly comes out of the L_M_M_, how the pre-training works in that sense uh on your own computer preferably, um and then of you learn about the pre-training, the supervised tuning the, attention mechanism you, get a solid understanding of how things work. But at some point you will reach a limit, because small models can only do so much. And the problem with learning about L_L_M_s at scale is I would say it's exponentially more complex to make a larger model, because it's not that the model just becomes larger. You have to now think about sharding your parameters across multiple G_P_Us. You even for the K_V_ cache, there are multiple ways you can implement it. One is just to understand how it works, just to grow the cache. That's this a cache you grow step by step by let's say concatenating lists um growing it, but then it wouldn't be optimal in G_P_U_s, you wouldn't do that. You would pre-allocate a tensor and then fill it in. But that adds again another twenty thirty line lines of code, and for each thing you add so much code. And I think the trick with a book is basically to understand how the L_L_M_ works. It's not gonna be your production level L_L_M_, but once you have that, you can understand the production level L_L_M_. So you're trying to always build an L_L_M_ that's gonna fit on one G_P_U_? Yes, the most of them I have they've I have some bonus materials on some uh uh M_O_E_ models. I think one or t or two of them they may require multiple G_P_Us, but the goal is to have it on one G_P_U_. And the beautiful thing is also you can self-verify. It's almost like R_L_V_R_ when you code these from scratch. You can take uh an existing model from the hugging phase transformer library Um. so the hugging phase tranf former library is great, but if you wanna learn about L_M_S_, I think that's not the best place to start because the code is so complex because it has so it has to fit so many use cases. Also some people use it in production. It has to be really sophisticated and it's really intertwined and really hard. It's not linear to read. It was started as a fine tuning library. And then it grew to be like the standard representation of every model architecture and the way it is loaded. So hugging phase is like the default place to get a model and, transformers is the software that enables it, so people can easily load a model and do something basic with it. And all frontier labs that have open weight models have a hugging phase transformers version of it, like from deep seek to G_P_T_O_S_S_. That's like the canonical weight uh that you can load there. But again also even transformers, the library is not used in production. People use then S_G_ lang or V_L_L_M_ and it's adds another layer of complexity. We should say that the transformers library has like four hundred models. So it's a one library that tries to implement a lot of L_L_M_s and so you have a huge code base basically, it's like huge, it's like uh i it's I dunno, maybe millions of thousands of lines of code and the find it's like understanding the part that you wanna understand is finding the needle in the haystack. Well what's beautiful about it is uh you have a working implementation and so you can work backwards from it. What I w m recommend doing or what I also do is if I wanna understand for example how almost three is implemented, I would uh look the weights in the model hub, the config file, and then you can see oh they use so many layers, they use let's say group query attention or multi-head attention in that case, then you see all the components in like a human readable, I dunno, hundred lines of config file, and then you start let's say with your G_P_D_ two model and add these things, you know. And the cool thing here is you can then lot the pre-trained weights and see if they work in your model. And you wanna match the same output that you get with a transformer model and then you can use that as a basically as a verifiable to make your architecture correct. And then it's kind of sometimes it takes me a day to with almost three the the challenge was uh rope for the position embeddings. They had a yarn uh extension and there was some custom uh uh scaling there and I couldn't quite match the these things. And in this struggle you kind of understand things, but the cool thing is at the end you know you have it correct because you can unit test it, you can check against the reference implementation. And I think that's maybe one of the best ways to learn really like basically reverse engineer something. Yep. I think that that is something that everybody that's interested in getting to A_I_ today should do. And I think that's why I liked your book is like I'd came to language models from this R_L_ and robotics field. Like I'd never had taken the time to just like learn all the fundamentals and this transformer architecture I described as being like so fundamental as like deep learning was a thing that I had to learn in the past and people need to do this. And I think that where a lot of people kind of get overwhelmed is how do I apply this to have packed or find like a career path because like A_I_ and language models make this fundamental stuff so accessible and people with motivation won't learn it. And then it's like how do I get the cycles on goal to contribute to research. And I think that I'm actually fairly optimistic in this because the field moves so fast that a lot of times the best people like don't fully solve a problem because there's a bigger lower ha like a m bigger problem to solve that's very low hanging fruit so they move on. And I think that a lot of what I was trying to do in the early Jeff book is like take post-training techniques and just describe how people think about them influencing the model and what people are doing and then it's remarkable how many things I just think are just like people stop studying them or don't. So I think people trying to get narrow after doing the fundamentals is good and then reading the r relevant papers and being engaged in the ecosystem it's like you actually the proximity that random people have online from the leading researchers like th no one knows who all the th the anonymous account on X_ and M_L_ is very popular for whatever reason. And no one knows who all these people are. Like it could just be random people that study the stuff deeply. Especially with the A_I_ tools and just be like ke I don't understand this, keep digging into it. I think it's a very useful thing. But there's a lot of research areas that just like are maybe three papers that you need to read. And then one of the authors will probably email you back. But you have to put in a lot of effort into these emails to understand the field. Like I think it would be for a newcomer easily weeks of work to feel like they can truly grasp like what is a very narrow ari area but, I think going narrow after you have the fundamentals would be very useful to people because it's like I've became very interested in like character training which is like how you make the model funny or sarcastic or serious and like what do you do to the data to do this and it's like a student at Oxford reached out to me and it's like hey I'm interested in this and I advised him and I was like that paper now exists and it's like I don't there's like two or three people in the world that were very interested in this. He's a P_H_D_ student which gives you an advantage, but like for me that was a topic I was waiting for someone to be like hey I've time to spend cycles on this and I'm sure there's a lot more very narrow things where you're just like oh it doesn't make sense that there was no answer to this and I think that it's just like there's so much information coming that people are like I can't grab onto any of these but if you just actually stick in an area I think there's a lot of interesting things to learn. Yeah I think you can't d try to do it all because it would be very overwhelming and you would burn out if you try to keep up with everything. For me uh for example I haven't kept up with computer vision in a long time uh just focused on L_M_s. But coming back to your book for example I think this is also a g a really great book and a g really good bang for the buck because you wanna learn about R_L_H_F_ I wouldn't go out there and read R_L_H_F_ papers because I would be you would be spending two years. Yeah. Yeah. Yeah. And we'll see what comes out to be true. wha what are some of the just to go through some of the table of contents, some of the ideas we might have missed in the bigger picture of the post-training. So first of all you do the problem setup, training overview, what are preferences preferences, data, and the optimisation tools, reward modelling, regularisation, instruction tuning, rejection sampling, reinforcement learning, I_E_ policy gradients, direct alignment algorithms, then constitutional A_I_ and A_I_ feedback, reasoning and inference time scaling, tool use and function calling, synthetic data and distillation, evaluation, and then open questions section over optimization style and information, and then product, U_X_, character, and post-training. So what are some ideas worth mentioning that connect both the educational component and the research component? You mentioned the character training. Which is pretty interesting. character training is interesting 'cause there's so little out of it, but we talk about how people engage with these models and like look we feel good using them 'cause they're positive, but that can go too far, it could be too positive and it's like essentially it's how do you change your data and or decision making to make it exactly what you want. And I open A_I_ has this thing called a model spec which is essentially their internal guideline for what they want the model to do and they publish this to developers, so essentially you can know what is a of opening eyes training, which is like they have the intentions and they haven't met it yet, versus what is something that they like actually wanted to do and that you don't like. And that transparency is very nice, but all the methods for curating these documents and how easy it is to follow them is not very well known. I think the way the book is designed is that the reinforced learning chapter is obviously what people want, because everybody hears about it with R_L_V_R_. And it's the same algorithm, it's in the same map, but it's just like you can use it in in very different documents. So I think the core pref of R_L_H_ is like how messy preferences are, is essentially rehash of a paper I wrote years ago. But this is essentially the chapter that'll tell you why R_L_H_F_ is never ever fully solvable, because like the way that even R_L_ is set up is that um it assumes that preferences can be quantified and that multiple preferences can be reduced to single values. And I think it relates in the economics literature to the von Neumann-Morgenstern And like that is the chapter where all of that philosophical, economic, and like psychological context, it tells you what gets compressed into doing R_L_H_F_. So it's like you have all of this and then at the and later in the book it's like you use this R_L_ map to make the number go up. And I think that that's why I think it would be very rewarding for people to do research on is because it's like quantifying preferences is something that is just like humans have designed the problem in order to make preferences studyable. But there's kind of fundamental debates on like an example is in a language model response you have different things you care about, whether it's accuracy or in style, and when you're collecting the data they all get compressed into like a I like this more than another. And it's like like that is happening and there's a lot of philosophical there's a lot of research in other areas of the world that go into like how should you actually do this. I think social choice theory is the subfield of economics around how you should aggregate preferences. And there's like uh I was I went to a workshop that published a white paper and I'm like can you think about using social choice theory for R_L_H_F_? So I mostly would want people that get excited about the math to come and have things that they could stumble into and learn this kind of broader context. I think there's a fun thing I just keep a list of all the tech reports that I like of reasoning models. So in the in chapter fourteen where there's a kind of like a short summary of R_L_V_R_ there's just like a gigantic table where I just like list every single reasoning model that I like. So there's just like I think in education a lot of it needs to be like at this point it's like what I like because the language models are so good at the math where it's like famous paper direct preference optimisation which is like a much simpler way of pro solving the problem than R_L_ um the derivations and the appendix skip steps of math and it's like I tried for this book like I redid the derivations and I'm like what the heck is this log trick that they use to change the math but doing it with language models they're like this is the log trick and I'm like I don't know if I like this that the math is so commoditized I think like some of the struggle in reading the this appendix and following the math I think is good for learning, and I d it's uh both. Some of the providers are starting to work on models for education which are designed to not give actually I haven't used them, but I would guess they're designed to not give all the information at once and make people work to do this. So I think you could train models to do this, and it would be a wonderful contribution, where like all of this stuff in the book, you have to reevaluate every decision for it, which is such a great example. I th I think there's there's a chance we work on it at A_I_ too, which I w which I was like oh, I think this is be so cool. Mm-hmm. Mm-hmm. fully probate, but the problem here is I think it requires discipline, and a lot of people do math for like I mean there are a lot of people who enjoy math, but there are also a lot of people who need to do it for their homework, and then it's like the shortcut. And yeah, we can then develop an educational L_L_M_, but the other L_L_M_ is still there, and there's still a temptation to use the other L_L_M_s. They they understand the stuff they're passionate about, they're self-aware about it, and they understand it shouldn't be easy. Like I think we just have to develop a good taste. We talk about research taste, like school taste, about stuff that you should be struggling on and and stuff you shouldn't be struggling on, which is tricky to know 'cause sometimes you don't have good um long-term vision about what would be actually useful to you in your career. But you have to you have to be develop that taste, and yeah. I was talking to maybe my fiance or friends about this, and it's like there's this brief ten year window where all of the homework and all the exams could be digital, but before that everybody had to do all the exams in Bluebook 'cause there was another way, and now after A_I_ everybody's gonna need to be in Bluebooks and oral exams 'cause everybody could cheat so easily. It's like this brief generation that had a different uh education system that c like everything could be digital and but you still couldn't cheat, and now it's just gonna go back, but it's just very funny. You mentioned character training, just zooming out on on a more general topic. For that topic, how much compute was required, and in general to contribute as a researcher, are there places where not too much compute is required where you can actually contribute as an individual researcher? For on the character training thing, I think this research is built on fine tuning about seven billion parameter models with LoRa, which is like a essentially your only fine tune, a small subset of the weights of the model. I don't know exactly how many G_P_U_ hours that would take. But it's doable. not doable for every academic. So the situation for some academics is like so dire that the only work you can do is doing inference where you have closed models or open models and you get completions from them and you can look at them and understand the models. And that's very well suited to evaluation, which you become ex you want to be the best at creating representative problems that the models fail on or show certain abilities, which I think that you can break through with this. So I've mm like I think that the top-end goal for a researcher working on if you want to have career momentum is the frontier labs pick up your evaluation. So it's like you don't need to have every project do this. But if you go from a small university with no compute and you figure out something that Claude struggles with and then the next Claude model has it in the blog post, like there there's your career rocket ship. I think that that's hard, but it's like if you wanna scope the maximum possible impact with minimum compute, it's something like that, which is just get very narrow and it takes learning of where the models are going. So you need to like build a tool that tests where not clod four point five will fail. If you're gonna do a rea if I'm gonna start a research project, I need to think where the models in eight months are gonna be struggling. But what about developing totally novel ideas? this is a trade-off. I think that if you're doing a P_H_D_ you could also be like it's too risky to work in language models, I'm going way longer term, which is like what is what is the thing that's gonna define language model development in ten years. Which I think that I end up being a person that's pretty practical. I mean I went to my P_H_D_ where it's like oh I got into Berkeley, worst case I get a master's, and then I go work in tech. And so like I'm very practical about it, so I'm like the life afforded to people to work at these A_I_ companies, the amount of Like OpenAI's average compensation is over a million dollars in stock a year for employee. Any normal person in the U_S_ to get into this A_I_ lab is transformative for your life. So I'm pretty practical. I was like there's still a lot of upward limility working in language models if you're focused. And the outcomes is like look at these jobs. But from a research perspective the transformative impact and these academic awards of the like be the next Jan LeCun is from not working on not caring about language model development very much. So I'd get to work with some awesome students and they're like should I go work in an A_I_ lab and I'm like uh like you're getting a P_H_D_ at a top school or you're gonna leave to go to a lab I'm like I don't know like if you go work at a top lab I don't blame you. Don't go work at some random start-up that might go to zero. But if you're going to open A_I_ I'm like it could be worth leaving a P_H_D_ for. Let's more rigorously think through this. Where would you give a recommendation for people to do a research contribution? So the options are academia, so get get a P_H_D_, spend five years publishing, c compute resources are constrained. There's uh there's research labs that are more focused on open weight models and so working there. Or closed frontier labs, research labs. Open A_I_ and Tropic, X_A_I_, so on. Mm-hmm. Mm-hmm. Mm-hmm. and trade-offs that in my opinion favour just like take the t take the well-paying job with meaningful impact. So it's like not only so that you're getting paid to sit around at open A_I_ you're, building like the cutting edge of things that are changing millions of people's relationship to tech. But it's you're a cognitive machine. I think it's b honestly it hasn't changed that much. Uh so I I have been in e academia. I'm not in academia anymore. At the same time I wouldn't wanna miss my time in academia, but what I wanted to say before I get to that part, I think it hasn't changed that much. I um was working in um like I was using A_I_ um machine learning methods um for applications and computational biology with collaborators and a lot of people went from academia directly to Google and uh I think it's the same thing back professors were like, you know, sad that their students went into uh industry wi because they couldn't carry on their legacy in that sense. And I I think it's the same thing. I mean it's like it it hasn't changed I think that much. The only thing that has changed is the scale. But you know, cool stuff was always developed in industry that was closed. You could couldn't talk about it. And I think the difference now is um well, your preference, do you like to talk about your work publish, uh or you know, you're you are more in a closed lab? uh the al the that's one difference, the compensation of course, but it's always been like that, I think. So it really depends on, you know, where you feel comfortable, and it's also nothing is forever. The only thing right now is there's a third option, which is um starting a start-up. That's a lot of uh people doing start-ups, very risky move, uh but can be high is a high risk, high reward type of situ situation where uh joining an industry lab I think is pretty safe uh, you know, also upward m mobility i uh honestly I if once you have been at a industry lab, it will be easier to find future jobs. But then again, you know, it's like yeah, how much do you enjoy the team and working on propriety things, versus how do you like the publishing work. I mean publishing is stressful, it is um you know like uh acceptance rate at conferences can be arbitrary, can be very frustrating, but also high reward if you have a paper plow to publish, you feel good because your name is on there, you have a high And, you know, feel like my friends who are professors seem on average happier than my friends who work at a frontier lab, to be totally honest. 'Cause that's just grounding and the frontier labs definitely do this nine nine six, which essentially is shorthand for work all the time. Can you describe nine nine six as culture that's, I believe you could say invented in China and uh adopted in Silicon Valley. What's what's nine nine six? It's nine A_M_ to nine P_M_ and six days a week What. is that seventy, two hours? Okay. So what is this basically the standard in A_I_ companies in Silicon Valley? More and more this kind of grind mindset? Exactly like that. But I think there is a trend towards it. And it's interesting, I think it almost flipped because when I was in academia, I felt like that because uh as a professor you had to write grants, you had to do t uh you had to teach and you had to do your research. It's like three jobs in one. And it is more than a full time job if you wanna be successful. And um I feel like now like Nathan just said, the professors i in comparison to a lab, I think they have less like even maybe pressure or workload than at a a frontier lab because they work a lot, they're just so fulfilled. But like working with students and having a constant runway of mentorship and like a mission that is very people-oriented, I think in a aero when things are moving very fast and very chaotic is very rewarding to people. have to make it and it's like it is really important that people put in the time. But well it is really hard because you have to deliver constantly. And I've been at a start-up, I had a good time, but I don't know if I could do it forever. It's like a interesting pace uh and is exactly like we talked about in the beginning. Uh these models are leapfrogging each other and they are ju just constantly like trying to take the next step compared to the competitors. It's just ruthless I think right now. I think this leapfrogging nature and having multiple players is actually an underrated driver of language modelling process where competition is so deeply ingrained to people and these companies have intentionally created very strong culture like, anthropic is known to be so culturally like deeply committed and organised. I mean like we hear so little from them and everybody that's anthropic seems very aligned and it's like being at a culture that is super tight and having this competitive dynamic is like talk about a thing that's gonna make you work hard and create things that are better. So I think that this that that comes at the cost of human capital which is like you can only do this for so long and people are definitely burning out. I think I've I wrote a post on burnout as I like I've tried in and out of this myself especially trying to like be a manager of full mode training. It's a crazy job doing this. The book Apple in China by Patrick McGee he talked about the how hard the Apple engineers worked to set up the supply chains in China, and he was like they had saving marriage programmes, and he told in a d a podcast here it was like m people died from this level of working hard. So I think that it's just like it's a perfect environment for creating progress based on human expense, and I d it's there's gonna be a lot there's a lot of the human expense is the nine nine six that we started this with, which is like people do really grind. I'll also write this book I. think they had a code word for if someone had to go home to spend time with their family to save the marriage and uh it's crazy then and colleagues uh understand okay this is like r read alert for this situation we have to let that person go home this weekend and um but at the same time I don't think they were forced to work it's really they were so passionate about the product I guess that it is it is you you get into that mindset and I I had that sometimes as an academic but also as an independent person I have that sometimes I overwork and it's unhealthy I had you know I had issues I had neck issues because I did not take the breaks that I maybe should have taken but, it's not because no one forced me to it's because I wanted to work because it's exciting stuff Yeah. I have this great fortune of having conversations with wide variety of human beings and from there I get to see all these bubbles and echo chambers across the world and it's fascinating to see how we humans form them and I think it's fair to say that Silicon Valley is a kind of echo chamber uh a kind of um silo and bubble. I think bubbles are actually really useful and effective. It's not necessarily a negative thing 'cause it could be ultra-productive, it could be the the Steve Jobs reality distortion field 'cause you just convince each other the breakthroughs are imminent and by convincing each other of that you make the breakthroughs imminent. Mm-hmm. Bern Hooper wrote a book classifying bubbles, but essentially one of them is financial bubbles which is like speculation which is bad and the other one is like I don't know the term but effectively for build outs because it pushes people to build these things and I do think A_I_ is in this, but I worry about it transitioning to a financial bubble which is like it's Yeah, but also in the space of ideas, that bubble, you are doing a reality distortion field, and that means you are deviating from reality. And if you go too far from reality, while also working, you know, nine nine six, and y you might miss some fundamental aspects of the human experience, including in Silicon Valley, and this is a common problem in Silicon Valley, is like is a very specific geographic area, you might not understand the Midwest perspective the full experience of all the other different humans in the United States and across the world, and you and you speak a certain way to each other, you convince each other of a certain thing, and that that can get you into r real trouble. Whether A_I_ is a big success and becomes a powerful technology, or it's not in either trajectory, you can get yourself into trouble. So you have to consider all of that. Here you are, a young person trying to decide what you wanna do with your life. The thing that is I don't even really understand this, but the S_F_A_I_ memes have gotten to the point where permanent underclass was one of them, which was the idea that the last six months of twenty twenty five was the only time to build a durable value in A_I_ startup or model. Otherwise all the value will be captured by existing companies and you will therefore be poor. Which like that's an example of the S_F_ thing that goes so far. I still think for young people that going to be able to tap into it if you're really passionate about wanting to have a impact in A_I_. Like being physically an S_F_ is the most likely place for you going to do this, but it has it has trade-offs. I think S_F_ is an incredible place, but there is a bit of a bubble. And if you go into that bubble, which is extremely valuable, just get out also, read history books, read literature, uh visit other places in the world, Twitter is not and sub-stack is not the entire world. I think I would say one of my one people I worked with is moving to S_F_ and it's like I need to get 'em a copy of the season of The Witch, which is a history of S_F_ from like nineteen sixty to nineteen eighty five which goes through like the hippie rev revolution, like they all the um gaze kind of taking over the city and that culture emerging and then the H_I_B_ AIDS crisis and other things and it's just like that is so recent and so much turmoil and hurt but also like love and S_F_ and it's like no one knows about this the, great season of The Witch I recommend it. A bunch of my S_F_ friends were who do get out recommended it to me, and I think that it's just like living there like I lived there and I didn't appreciate this context and it's just like so recent. Yeah. Okay let's uh we talked a lot about a we talked a lot about a lot of things. Um certainly about the things that were exciting last year, but this year uh one of these you guys mentioned is exciting is uh the scaling of text diffusion models and it's just a different exploration of text diffusion. Can you talk about what that is and what the possibility holds? Sort of different kinds of approaches than the current L_M_s. about the transformer architecture and the autoregressive transformer architecture specifically like G_P_T_ and it doesn't mean no one else is working on anything else, so people are always on the let's say look out for the next big thing because I think it would be almost like um yeah stupid not to because sure m right now the transformer architecture is the thing and it works best and there's right now nothing else other but you know it's always a good idea to not put all your eggs into one basket so people are developing other things uh alternatives to the um autoregressive uh transformer. One of them would be for example text diffusion models and listeners may know diffusion models from the image generation, like uh u stable diffusion popularised it. There was like a paper on generating images. Back then people used GANs, the generative adversarial networks. And then there was this diffusion process where you iteratively denoise an image and that m resulted in really good quality images over time. Stable diffusion was a company. Other companies built their own diffusion models and then people are now like okay can we try this also for text. Doesn't, you know, make intuitive sense yet because it feels like okay uh it's not something continuous like a pixel that we can differentiate, it's like a discrete text, so how do we im implement that denoising process. But uh it's kind of like similar to uh the BERT models by Google, like when you go back to the original transformer uh so there were like the encoder and the decoder. The decoder is what we are using right now in in G_P_T_ and so forth. The encoder i it's more like um a parallel let's say technique where you have multiple tokens that you fill in in parallel. Instead of so G_P_T_ models they do autoregressive one token at a time. You c you complete the sentence one token at a time. And in BERT models you have a text that say sentence that has uh gaps. Uh you like mask them out. And then one iteration is filling in these gaps. And text diffusion is kind of like that where you are starting with let's say some random text. And then you are filling in the missing parts or you are refining them iteratively and you have multiple iterations. And the cool thing here is that this can do multiple tokens at the same time. So it's kind of like the promise of having it more efficient. Now the the trade-off is of course well how good is the quality, it might be faster. Uh and then now you have this dimension of the denoising process, the more steps you do the better the text becomes. Um and people you know n I mean you you can scale in different ways, they try to see if that is maybe a valid alternative to the autoregressive model in terms of giving you the same quality for less compute. Uh right now I think it's you know there are papers that suggest okay if you wanna get the same quality you uh have to crank up the n the noising steps and then you end up spending the same compute you would spend on an autoregressive model. Um the other downside is well it's parallel which sounds appealing, but some tasks are not parallel. Like you know like reasoning task tool use maybe where you have to ask an in quote interpreter to give you an intermediate result and that is kind of tricky with diffusion models, so there are some hybrids, but the main idea is how can we parallelize it? And so interesting avenue, I think right now there are mostly research uh let's say models out there like Lada and some other ones. I I saw some by start-up some deployed models. There is no big uh diffusion model at scale yet, like you know like Gemini chat G_P_D_ scale uh in that level. But there was an announcement by Google like uh a site where they said they are launching Gemini diffusion and they put it into context of their I think Nano two And that they said basically for the same quality on most benchmarks we can generate things much faster. So you m uh mentioned what's next. I don't think the text diffusion model is gonna replace autoregressive algorithms, but it will be something maybe for quick uh cheap at scale tasks. Maybe the free tier in future will be something like that. think there's a couple examples where it's I I've heard that it's actually been started to be used. I think to paint an example of why this is so much better, for example when G_P_T_ five is taking thirty minutes to respond is generating one token at a time, and this diffusion idea is essentially generate all of those completion all of those tokens in the completion in one batch, which is why it could be way faster. And I think it could be suited the start-ups I'm hearing are like codes start-ups where you have a code base and you have somebody that's effectively vibe coding and they say this change. And a code diff is essentially a huge reply from the model, but it doesn't have to have that much external context, and you can get it really fast by using these diffusion models. So that's what I've heard of one example is that they use these text diffusion to generate really long diffs, because doing it with a autoregressive model would take minutes, and that time for like a user-facing product causes a lot of churn. So like every second you lose a lot of users. So I th I think that's gonna be this thing where it's gonna grow and have some applications. But I actually thought that different types of models were gonna be used for different things more sooner sooner than they have been. So I had a kind of trade-off. I think that the tool use point is the one that's stopping them from being um like most general purpose 'cause like cloud code and this ch had to be played with search. Like the re the autoregressive chain is interrupted with some external tool and I I don't know how to do that with the diffusion set-up. So what's the future of uh tool use this year and in the coming years? Do you think there's gonna be a lot of de developments there? How that's integrated to the entire stack? I do think right now I mean it's mostly on the proprietary L_M_M_ side um but I think we will see more of that in the open source tooling and I think I mean it is a huge unlock because then you can really outsource certain tasks from just memorization to actual m you know like instead of m having the L_M_M_ memorize what is twenty three plus five, just use a calculator. So do you think that can s help solve hallucination? uh not solve it, but reduce it. Um so so the L_M_M_ needs to know what um like when to ask for a tool call. And the second one is well it doesn't mean the internet is always correct. You can do a web search, but let's say um I asked who won the World Cup in let's say nineteen ninety eight. It still needs to find the right web site and get the right information. So you c you can still go to the incorrect web site and give me incorrect information. So I don't think it will fully solve that, but it is improving it in that sense. And um so another cool paper earlier this year, I I think was gen uh December thirty first. So it's not technically twenty twenty six, but close. So th like the recursive uh language model. That that's a cool idea to kind of take this even a bit further. So just to explain uh so Nathan, you also mentioned earlier, it's harder to do cool research in academia because of the compute budget. If I recall correctly, they did everything with G_P_D_ five. So they didn't even use local models. But the idea is let's say if we non-context instead of having the L_L_M_ solve all of it in like one shot or even like in a chain, you break it down into sub-tasks. You have the L_L_M_ decide when like what is a good let's say sub-task and then recursively call an L_L_M_ to solve that and I think something like that also then adding tools and you know each one maybe you have like a huge Q_ and A_ task so each one goes to the web and gathers information and then you pull it at the end together and stitch it back together um like where uh I think there's gonna be a lot of unlock using things like that where you ne not necessarily improve the L_L_M_ itself, you improve how the L_L_M_ is used and what the L_L_M_ can use. One downside right now with tool use is you have to give the L_L_M_ permission to use tools and uh that will take some trust, especially if you wanna unlock uh things like having an L_L_M_ answer emails for you. Or not even answer, but just sort them for you or select them for you or something like that. I don't know if I would today give an L_L_M_ access to my emails right, I mean as like a huge risk. I thi I think there's a cool one last point on the tool use thing. I think that you hinted at this and we've both come at this in our own ways is that the open versus closed models use tools in very different ways where open models people go to hug and face and you download the model and then the person's gonna be like oh what tool do I want and I dunno EXA is my search preferred search provider but somebody else might care for a different search start-up where you release a model it needs to be useful for multiple tools for multiple use cases which is really hard 'cause you're making like a general engine model, which is actually what G_P_T_O_S_S_ is good for. But on the closed models, you're deeply integrating the specific tool into your experience. And I think that open models will struggle to replicate some of the things that I like to do with closed models, which will be like, I don't know, you can reference a mix of public and private information and something that I keep trying every three to six months. I try like codex on the web, which is just prompting a model to make an update to some GitHub repository that I have. And it's just like like that sort of secure cloud environment is just so nice for just like send it off and do this thing and then come back to me. And these will probably help define some of the local open and closed niches, but I think initially 'cause there was such a rush to get these tool use working that the open models were on the back foot. Which is kind of inevitable, I think there's so much resor so many resources in these frontier labs, but will be fun when the open models solve this because it's gonna like a bit more flexible and potentially interesting model that might work with this recursive idea to like be an orchestrator and a tool used model. So hopefully the necessity drives some interesting innovation there. So continual learning uh this is a long-standing topic important, problem. I think that increases in importance as the cost of training of the models goes up. So can you explain what continual learning is and how important it might be this year and in the coming years to make progress? This relates a lot to this kind of S_F_ get zeitgeist of what is A_G_I_, what is which is artificial general intelligence, and what is A_S_I_ artificial super-intelligence, and what are the language models that we have today capable of doing. I think the language models can solve a lot of tasks, but a key milestone among the A_I_ community is essentially when A_I_ could replace any remote worker taking in information and solving digital tasks and doing them. And I th the limitation that's highlighted by people is that a language model will not learn from feedback the same way that an employee is. So if you hire an editor, the editor will mess up, but you will tell them, and if you hired a good editor, they don't do it again. But language models don't have this ability to modify themselves and learn very quickly. So the idea is if we're gonna actually get to something that is a true like general adaptable intelligence that can go into any remote work scenario, it needs to be able to learn quickly from feedback and on-job learning. Mm-hmm. I'm personally more bullish on language models by being able to just provide them with very good context. You said like you maybe off-line said that like you can write extensive documents to models where you say I have all this information, here's all the blog posts I've ever written, I like this type of writing, my voice is based on this, but a lot of people don't provide this to models and the models weren't designed to like take this m amount of context previously, like the agentic models are just starting. So it's this kind of trade-off of do we need to the weights of this model with this continual learning thing to make them learn fast. Or the counter-argument is we just need to provide them with more context and information and they will have the appearance of learning fast by just having a lot of context and being very smart. So w we should mention the terminology here. So continual learning refers to changing the weights continuously so that the model adapts, adjusts based on the new the new incoming information, does so continually and rapidly and frequently and so on. Uh and then the thing you mention on the other side of it is d generally will be referred to as in-context learning. As you learn stuff there's a huge context window. You can just keep loading was extra information every time you prompt a system, which I think both are legitimately can be seen as a learning. It's just a different place where you're doing the learning. of weights we already have that in different flavours. I mean if you think about how uh so th I think the um the th distinction here is do you do that on a personalised custom model for each person or do you do it on a global model scale. And I think we have that already with uh going from G_P_D_ five to five point one and five point two. It's maybe not immediate, but it is like a curated update, a quick curated update where uh there was feedback by the you know, things it couldn't do, feedback by the community, they updated the weights next model and so forth. So it is I mean kind of like a flavour of that. Um other even finer-grained example a finer-grained example is like R_R_L_V_R_ you run it, it updates. The problem is you can't just do that for each person because it would be too expensive to update the weights for each person. And and I think that's the problem. So unless you get I th I mean even at open AI scale building b data centres it would be too expensive I think. That is only feasible once you have something on the device where the cost is on the like what Apple tried to do with the Apple Foundation models, putting them on the phone, and then they learn from the experience. Uh a bit of a related topic, but uh this kind of um uh maybe anthropomorphize term, but memory. What are the different ideas of the mechanism of how to add memory to these systems? Is your increasing seeing soul? So personalised memory especially. it's mostly like uh context basically, s stuffing um things into the context and then just recalling that. But again, like it is I think well, it's expensive because you have to like I mean you can cash it, but still you spend tokens on that. And the second one is you can only do so much. I think it's more like a preference and or a style. I mean a lot of people do that when they solve math problems. You say uh there's ways that you can add previous knowledge and stuff, but you also give it certain preference pumps do what I preferred last time, whatever, like something like that. But it does it doesn't unlock uh new capabilities. So for that uh w one thing people do use still is LoRa uh LoRa adapters. These are basically instead of updating the whole weight matrix, there are two smaller weight matrices um that you kind of have in parallel or overlay. It's like the delta. But um yeah, you you can do that to some extent, but then again it is economics. You c so there were also papers for Laura learns less, but forgets less. It's like you know, it's no free lunch. If you wanna learn more, you need to n use more weights, but it gets more expensive. And then again, if you learn more, you forget more. And i it's like you have to find that Goldilocks zone basically. We haven't really mentioned it much, but uh implied in this discussion is context length also Is. there a lot of innovations that's possible there? I think the colloquially accepted thing is that it's a compute and data problem, where you can and sometimes like small architecture things which are like attention variance. So if you have we talked about like hybrid attention models, which is essentially if you have what looks like a state space model within your transformer, and like those are better suited because you have to c spend less compute to model the f furthest along token. And I think that but those aren't free 'cause they have to be a need by a lot of compute or um the right data. So how many sequences of a hundred thousand tokens do you have in the world and where do you get these? And I think it just ends up being pretty expensive to scale them. So we've like gotten to pretty quickly to like a million tokens of input context length. And I would expect it to keep increasing and like get to like two million or five million this year. But I don't expect it to go to like a hundred million. That would be like a true breakthrough. And I think those breakthroughs are possible, like the continual learning thing think of it as a research problem where you could there could be a breakthrough that just makes transformers work way better at this and it's cheap. Like these things could happen with so much scientific attention but turning the crank it'll be s consistent mm increases over time. I think also looking at the extremes I, think there's again no free lunch. So uh the one extreme to make it cheap, you have a let's say an R_ and N_ that has a single spa a state where you save everything from the previous stu it's like a s uh specific um fixed size thing. So you never really g grow the memory because it's you are stuffing everything into one state. But then the longer the context gets, the more information you forget, because you can't you can't kip I mean compress everything into one state. Then on the other end you have the transformers which try to remember every token. which is great sometimes which we wanna look up specific information but very expensive because you have the K_V_ cache that grows, the um dot product that grows, but then yeah like you said the MAMBA layers I mean they kind of have the same problem I would say like an R_ and N_ you try to compress everything into one state, you're a bit more selective there, but then I think it's like this Goldilocks zone again with Unimotron three they found like a good ratio of how many attention layers do you need for the global information where everything is accessible compared to having these compressed states. I I think that's how y I think we will scale more by finding better let's say ratios in Goldilocks zone, like between um like compute uh making it cheap enough to run, but then also making it powerful enough to be useful. And one more plug here, um the rec recursive language model paper, that is one of the papers that tries to kind of address the long context thing. So what they found is essentially instead of stuffing everything into this long context um if you break it up into these smaller um multiple smaller tasks, so you save memory by having multiple smaller calls. You can get actually better accuracy than having the L_M_ try everything all at once. I mean it's a new paradigm. We will see, you know, there might be other flavours of that. So I think with that we will still make improvement on wrong context, but then also like Nathan said, I think the problem is for pre-training uh itself we don't have as many wrong context documents as other con uh documents, so it it's harder to study uh uh how L_M_s behave and stuff like that on on that level like. there are some rules of thumb where essentially you pre-train a language model. Like although we pre-trained at like eight K_ context length and then extended to thirty two K_ with training. And there are some rules of thumb where you're just like essentially doubling the training context length takes like two X_ compute and then you can normally like two to four X_ the context length again. So I think a lot of it ends up being kind of compute bound at pre-training which is in this link we talked about this, everyone talks about this big increase in compute for the top labs this year and that should reflect in some longer but uh I think on the post-training side there's some more interesting things, which is as we have agents, the agents are gonna manage this context on their own, where now people that use CLOD code a lot dread the compaction, which is when CLOD takes its entire full a hundred thousand tokens of work and compacts it into bulleted list. But what the next models will do, I'm just not a novel I'm sure people are already working on this, is essentially the model can control when it compacts and how. So you can essentially like train your R_L_ algorithm where compaction is an action, where it short the history and then the problem formulation will be I want to keep the maximum evaluation scores that I have gotten while the model compacts its history to the minimum length, because then you have the minimum amount of tokens that you need to do this kind of compounding auto-regressive prediction. So there's actually a pretty nice problem set-ups in this where the b like these agentic models learn to use their context in a different way than just plow forward. One interesting also recent example would be deep seek version three point two where they had like the sparse attention mechanism where they have essentially like a s very efficient small lightweight indexer and instead of attending to all the tokens it selects okay what tokens do I actually need. It's I mean it's uh it almost comes back to the original idea of attention where you are selective, but attention is always on you, have maybe zero weight on some of them, but you use them all, but they are even more like okay let's just mask that out or like not even co do that. And even with um uh sliding window attention, almost. That is also kind of like that idea. You have that rolling window where you keep it fixed 'cause you don't need everything all the time. You o occasionally some layers you might, but i it's wasteful. But right now I think yeah, if you use everything you're on the safe side, it gives you the best bang for the buck because you never miss inf information. And right now I think this year will be more also the year figuring out, like you said, how to be more smart about that. I I think right now people wanna have the next state of the art, and the state of the art is uh happens to be the brute force expensive thing. And then once you have that, like you said, keep that uh accuracy but, let's see how we can do that cheaper now, like tricks uh you know. Yeah, all this scaling thing. Like the reason we get the Claude four point five sonnet model first is because it you can train it faster and you're not hitting these compute walls as soon and they can just try a lot more things and get the model faster even though the bigger model is actually better. I think we should say that there's a lot of exciting stuff going on in the A_I_ space. Um my mind has recently been really focused on robotics. So we ha today really almost entirely didn't talk about robotics. Uh there's a lot of stuff on image gen video generation. Uh I think it's fair to say that the most exciting research work in terms of the amount intensity uh fervor is in the L_L_M_ space, which is why I think it's justified for us to really focus on the L_L_M_ that we're discussing. But it'd be nice to bring in some certain things that might be useful. For example w w world models, there's growing excitement on that. Do you think there'll be any use in this coming year for world models in the L_L_M_ space? Uh yes, I s I do think uh so also with L_L_M_s uh what's uh interesting thing here is I think if we unlock more L_L_M_ capabilities, it also automatically unlocks all the other fields because uh or not unlocks, but like makes progress faster as uh because uh you know a lot of researchers and engineers use L_L_M_s like we said for coding. So even if they work on robotics, if you ni optimize these L_L_M_s that help you with coding, you know, it's like it pays off uh but then uh yes, the wor models are interesting. It's basically where you have the model run a simulation of the world in a sense like a little toy thing of the real thing which can and again unlock capabilities um like that are n that L_M_ is not aware of, it's can simulate things and uh I think see this is like something I think L_ and M_s they just happen to work well by pre-training and then doing the next token prediction but w we could do this even a bit you know like um sophisticated in a sense. So uh what I'm saying is like with uh there's like um I think it was by paper uh code of world models. Um so where they basically apply the concept of uh world models to L_L_M_s again, where they and so instead of just having next token prediction and verifiable rewards checking the answer correctness, they also make sure the intermediate variables are correct. You know like, it's kind of like a mm the model is learning basically a code environment in a sense. And I think this makes a lot of sense. It's just like expensive to do, but uh it is like making things more sophisticated, like um like modelling the whole thing not just the result in uh so it i it can add more uh value. I remember when I was a grad student there is a um so there's a competition called CASP I think where they do m uh protein structure prediction. Like they predict the pr uh structure of a protein that is not solved yet uh at that point. So i in the sense this is actually great and and I think we need something like that for L_L_M_s where you do the benchmark but no one does so you hand in the results but no one knows the solution and then after the fact someone reveals that. But uh alpha fold uh when it came out it crushed uh you know this benchmark. I mean there there were also multiple very uh iterations. But I remember the first one um I'm not an expert in that subfield but the first one explicitly modelled the physical um interactions of the you know the physics of the molecule. Also like the angles impossible angles. And then in the next version I think they got rid of this. And so and just with brute force scaling it up and I think with L_M_s we are currently in this brute force scaling because it just happens to work. But I do think also at some point it might make sense to bring back this um uh thing and uh and I think with co uh with world models I think that is where w I think that might be actually quite cool. Um w I mean yeah and and of course also for robotics. Uh that is a completely uh unrelated from L_M_s. Yeah yeah, and in robotics it's very explicitly so there's the problem of locomotion and manipulation. Locomotion is much more solid especially in the learning domain. But there's a lot of value just like with the initial protein folding systems bringing in the traditional model based methods. So you don't it's it's unlikely that you can just learn the manipulation or the whole body local manipulation problem end to end. That's the dream. But then you realise when you look at the magic of the human hand and the complexity of the real world you realise it's really hard to learn this all the way through. The way I guess alpha fold two didn't. I'm excited about the robotic learning space, though I think it's collectively getting like supercharged by all the excitement and investment in language models generally, where they're getting like the infrastructure for training transformers, which is like a general modelling thing, is becoming like world class industrial tooling, where anythi wherever that was a limitation for robotics, it's just like way better there's, where we're compute. And then on top of like they take these language models and use them as kind of central units where you can do interesting explorative around something that kind of already works. And then I see it emerging as like kind of like we talked about hugging phase transformers and hugging phase I think when I was at hugging phase I was trying to get this to happen but it was too early it's like these open robotic models on hugging phase and be having people be able to contribute data and fine tune them. I think we're much closer now that the investment in robotics and I think self-driving cars is related and it enables this where it's like once you get to the point where you can have this sort of ecosystem where somebody can download a robotics model and maybe fine tune it to their robot or share data sets across the world then there's some data s there's some work in this area like R_T_X_ I think is a few years ago where people are trying to do that. But I think once they have this ecosystem it'll look very different and then this whole post-chapp G_ G_ B_T_ boom is putting more resources into that which I think is a very good area for doing research. This is also resulting in a much better, more accurate, more realistic simulators being built uh closing the sim to real gap in the robotic space. But you know, you mentioned a lot of excitement in the robotics space and a lot of investment. The downside of that, which happens in hype cycles, I personally believe most robotics people believe that the it's not robotics is not going to be solved at at the time scale as being kind of implicit or explicitly promised. And so what happens when there's all these robotics companies that spring up and then they don't have a product that works, then there's going to be this kind of crash of excitement, which is nerve-racking. There's hopefully something else will come in and keep swooping in so that the the the continued development of some of these ideas keeps going. also related to the continual learning issue essentially where the real world is so complex where with L_L_M_s yeah, you don't need to really have something learned for the user because there are a lot of p uh things everyone has to do, everyone maybe wants to I don't know, fix their grammar in their email or code or something like that. It's it's more constrained so you can kind of prepare the model for that. But preparing the robot for the real world uh that's harder I mean you have the foundation models the, robotic foundation models, but you you can learn certain things like grasping things but then again I think uh every so everyone's house is different you know like that it's so different and uh that is I think where the remote would have to learn on the job essentially and I think that I guess is the the bottleneck right now like how to you know customising it uh on the fly essentially. I do I don't think I can possibly understate the importance of the thing that doesn't get talked about almost at all by robotics folks or anyone is safety. All the interesting complexities we talk about learning, all the failure modes and failure cases, everything we've been talking about at L_L_M_ sometimes it fails in its interesting ways, all of that is fun and games in the L_L_M_ space, in the robotic space, in people's homes, across millions of minutes billions of interactions. You really are almost allowed to fail never. When you have embodied systems that are put out there in the real world, you just have to solve so many problems you never thought you'd have to solve when you're just thinking about the general robot learning problem. And so bearish on in-home learned robots for consumer purchase. I'm very bullish on self-driving cars and I'm very bullish for robotic automation, e.g. like Amazon distribution, where Amazon has built whole new distribution centres designed for robots first rather than humans. Mm-hmm. The path to robots doing that is m more reasonable, where it's like a thing that is designed and optimised to do a repetitive task that a human could conceivably do but doesn't want to. And then I'm but but so uh but it's also gonna take a lot longer than people probably predict. I think that the the leap from A_I_ singularity to we can now scale up mass manufacturing in the U_S_ because we have a massive A_I_ advantage is one that is troubled by a lot of political and other challenging problems. Let's talk about timelines. Uh specifically timelines to A_G_I_ or A_S_I_. Is it fair like as a starting point to say that nobody really agrees on the definitions of A_G_I_ and A_S_I_? I kind of think there's a lot of disagreement, but among I've been getting pushback where a lot of people kind of say the same thing, which is like a d a d thing that could reproduce most digital economic work. So like the remo remote worker is a fairly reasonable example, and I think open A_I_'s definition is somewhat um related to that, which is like an A_I_ that can do a lot of economic like a certain number of economically valuable tasks, which I don't really love as a definition, but I think it could be a a grounding point, because Um language models today wha immensely powerful are are not this for remote work or drop-in, and there are things that you could think of that are d could be done by an A_I_ that are way harder than remote work, which are like solving a finding an unexpected scientific discovery that you couldn't even pause it, which would be an example of something that somebody says is like an artificial super-intelligence problem, or like sol uh taking in all medical records and finding linkages across certain illnesses that didn't know or is figuring out that some common drug can treat some niche cancer. Like they would say that that is like a super-intelligence thing. So these are kind of natural tiers. My problem with it is that it becomes deeply entwined with like the quest for meaning of A_I_ and these religious aspects to it. So there's kind of different there's different paths you can take it. And I don't even know if the remote work is a good definition 'cause w what exactly is that? It's like perfect tool use. I actually I mean I like I don't know if you like the originally titled A_I_ twenty seven report. They focus more on code and research taste. So the the target there is the superhuman coder. So they have several several milestone systems. Superhuman coder, superhuman A_I_ researcher, then super-intelligent A_I_ researcher and uh and the full artificial super-intelligence. But the after you develop the superhuman coder, everything else falls quickly. There the task is to f have a fully autonomous like automate coding. So any kind of coding you need to do in order to perform research is fully automated. And from there humans would be doing A_I_ research together with that system and they will quickly be able to develop a system that's actually can do the research for you. That's the idea. And then um initially their prediction was twenty twenty seven twenty eight, and now they've pushed it back by three to four years to to uh twenty thirty one mean prediction. Probably my prediction is even beyond twenty thirty one, but it at least you can in a g concrete way think about how difficult it is to fully automate programming. Yeah, I disagree with some of their presumptions and dynamics on how it would play out. Um but I think they did a good they did good work in the scenario defining milestones that are concrete and to tell a useful story, which is why the reach for this A_I_ twenty twenty seven document well transcended Silicon Valley is 'cause they told a g good story and they did a lot of rigorous work to do this. I think the camp that I fall into is that like A_I_ is like so-called jagged, which will be excellent at some things and really bad at some things. So I think when they're close to this automated software engineer, what it will be good at is that traditional M_L_ systems in front-end the model is excellent at, but the distributed M_L_ the models are actually really quite bad at 'cause there's so little training data on doing large scale distributed learning and things. And this is something that we already see, and I think those are just get amplified. And then it's kind of messier in these trade-offs, and then there's like how do you think A_I_ research works and so on. So you think basically superhuman coder is almost unachievable, meaning like because of the jagged nature of the thing, you're just always going to have gaps in capabilities. I think it's assigning completeness to something where the models are kind of superhuman at some types of code, and I can think that will continue. And people are creative, so they'll utilise this like incredible abilities and like to fill in the weaknesses of the models and move really fast. And it'll always kind of be this I've received for a long time this dance between the humans are enabling this thing that the model can't do and the best the best A_I_ researchers are the ones that can enable this superpower. And I think this aligns like to what we already see. I think like cloud code for building a you can stand up a beautiful website in a few hours or do data analysis. And I don't think it's it's gonna keep getting better at these things and it'll pick up some news code skills and stuff that it'll get along the way. And kind of linking to what's happening in in big tech is like this A_I_ twenty twenty seven report is like it leans into the singularity idea where I think research is messy and social and largely in the data in ways that A_I_ models can't process. But like what we do have today is really powerful and tech companies are all collectively buying into this with tens of billions of dollars of investment. So like we are gonna get some much better version of chat G_P_T_, a much better version of cloud code than we already have. I think that it's just like hard to predict where that is going, but the like bright clarity of that future is why some of the most powerful people in the world are putting so much money into this. And I think it's just kind of small differences between like we don't actually know what a better version of chat G_P_T_ is, but also can it automate A_I_ research? I would say probably not, at least in this time frame. Like big tech is gonna spend a hundred billion dollars much faster than we get a automated A_I_ researcher that enables a A_I_ research singularity. So you think y uh your prediction would be what? Like if this is even a useful milestone, we're more than ten years out. I would say less than that on the software side, but I think longer than that on the th things like research. Well let's just like for fun try to imagine a world where all software writing is fully automated. Like can you imagine that world? By the end of this year the amount of software that'll be automated will be so high, but it's like it'll be the things of like you're trying to train a model with R_L_ and you need to have multiple bunches of G_P_U_s communicating with each other, that'll still be hard, but I think it'll be much easier. One of the ways to think about this, so the full automation of programming is just think of like lines of f useful code written, the fraction of that to the number of humans in the loop. So presumably there'll be for a long time humans in the loop of software writing is just be fewer and fewer relative to the amount of code written, right. And the the the S_C_ superhuman code, I think the p the the presumption there is it goes to zero, the number of humans in the loop. What is that world look like when the number of humans in the loop is in the hundreds, not in the hundreds of thousands? I think software engineering will be driven more to system design and goals of outcomes, where I do think software is largely gonna be c'mon I think this has been happening over the last few weeks where people have gone from a month ago of like oh yeah agents are kind of slop which is a famous carpentry quote to like the what is a little bit of a meme of like the industrialization of software when anyone can just create software at their fingerprints like w I do think we are closer to that side of things and it takes direction and like understanding how the systems work to extract that best from the language models and I think it's hard to like accept the gravity of how much it's gonna change with software development and how many more people can do things without ever looking at it. what's interesting is to think about whether these systems will be um independent like, completely independent in the sense that well, I have no doubt that L_M_S_ will kind of at some point solve coding in a sense like c calculators solve calculating, right. So at some point humans develop the tool that you know y you nee never need a human to calculate that number, you just type it in and uh it's an algorithm, you you can do it in a in that sense and I I think that's the same probably for coding, but the question is uh so I think what will happen is yeah, you will just say that website, it will make a really good website and then you maybe refine it. But will it do things independently where so y will you be still having humans asking the A_I_ to do something? Like will there be a person say build that website? Or will there be A_I_ that just builds websites or something? Or whatever. I think using talking about building websites is the it's just like the there's the the the problem with websites and the problem with the web, you know, H_T_M_L_ and all that kind of stuff, it's very resilient to just slop. It'll show you slop as good as showing slop. I would rather m like think of like safety critical systems like uh asking A_I_ to end to end generate something that m manages logistics or manages cars and fleet of cars, all that kind of stuff. So and to end generate stuff for you. I think a more intermediate example is take something like Slack or Microsoft Word. I think if the organisations allow it, A_I_ could very easily implement features end to end and do a fairly good job for like things that you want to try, you wanna add a new w like tab in Slack that you want to use and, I think A_I_ will be able to do that pretty well. Actually that's a really great example. How far away are we from that? Like this year. See I, don't I don't know. I don't know. I don't know. how bad production code bases are, but I think that within like on the order of low years a lot of people are gonna be pushed to be more of like a designer and product manager where you have multiple of these agents that can try things for you and they might take one to two days to implement a feature or attempt to fix a bug and you have these dashboards which I think Slack is actually a good dashboard where your agents will talk to you and you'll then give feedback. But things like like I make a website it's like y you wanna make a logo that's Like I think these like cohesive design things and this style is gonna be very hard for models and deciding on what to add at the next time. I just okay so I hang out with a lot of programmers and some of them are a little bit on the skeptical side in general. That's just vibe-wise they're like that. I just think there's a lot of complexity involved in adding features to complex systems. Like if you look at the browser, Chrome, Mm-hmm. if I wanted to add a feature, if I wanted to have tabs as a as opposed to up top, I want 'em on the left side. Interface, right. Y I think we're not is not a next year thing. One of the Claude releases this year, one of their tests was we give it a piece of software and leave Claude to run to re-create it entirely. And it could already almost rebuild scra like Slack from scratch just given the parameters of the software and left in a sandbox environment to do that. Mm-hmm. So it might be that the smaller newer companies are advantaged and they're like we don't have to have the bloat and complexity and therefore this future exists. specification issue. So programming like you're like you're just assuming th this is like uh in in relationships c in friendships communication type of issue. You're assuming the L_M_ somehow is supposed to read your mind. I think this is where spec driven design is really important. Like you just m using natural language specify like what you want. I think that's like if you've talked to people at the labs, they use these in their training and production code. Like CLOD code is built with CLOD code, and they all use these things extensively, and Dario talks about how much of CLOD's code oh and and it's like these people are slightly ahead in terms of the capabilities they have, and they probably spend on inference they could spend ten to a hundred plus X_ as much as we're spending. Like we're on a lowly a hundred or two hundred dollar a month plan. Like they truly let it rip. And I think that that like with the pace of progress that we have it it seems like like where a year ago we didn't have a cloud code and we didn't really have reasoning models and it's like the difference between sitting here today and what we can do with these models and it seems like there's a lot of lo like there's a lot of low hanging fruit to improve them. The failure modes are pretty dumb. It's like cloud you tried to use the C_L_I_ command that don't have installed fourteen times and then then I sent you the command to run it's like that thing from a modelling perspective is pretty flexible. So I uh I agree with you. I've bec been uh becoming more and more bullish in general. Speaking to what you're articulating, I think it is a human skill issue. So anthropic is leading the way in uh n or other companies in understanding how to best use the models for programming therefore, they're e effectively using them. I think there's a lot of programmers on the outskirts they're, like they don't I mean there's not a really good guide on how to use them. People are trying to figure it out It might be very expensive. Like it might be that the entry point for that is two thousand dollars a month, which is only tech companies and rich people. Just like like that could be it. But it might be worth it. I mean if if if the final result is is a working software system, well it might be worth it. But by the way, it's funny how we converge from the discussion of timeline to A_G_I_ to something more pragmatic and and useful. Is there anything concrete and interesting and useful and profound to be said about timeline to A_G_I_ and A_S_I_ Or? are these discussions a bit too detach from the day to day? there's interesting bets. So there's a lot of people trying to do b reinforcement learning with verifiable rewards, but in real scientific domains where there's start-ups that are spending like they have hundreds of millions of dollars of funding and they have wet labs where they're having language models propose hypotheses that are tested in the real world. And I d I would say that I think they're very early or they're early, but with the pace of progress it's like maybe they're early by six months and they make it because they were there first or maybe they're early by eight years so you don't really know. So I think that that type of moon to um branch this momentum into other fi other sciences is like okay like that would be very transformative if like uh alpha fold moments happen in all sorts of other scientific domains by like a startup solving this. I think there are startups I think maybe harmonic is one where they're going all in on language models plus lean for math. I think you had another podcast guest where you talked about this recently and it's like we don't know exactly how it's gonna fall out of spending a million dollars on that model. And most of them will fail, but a couple of them might be big breakthroughs that are very different than CHI G_V_T_ or cloud code type software experiences. Like a tool that's only good for a P_H_D_ mathematician, but makes them a hundred X_ effective like, Okay, I agree. I think this will happen uh in a lot of uh domains, especially also like f uh domains that have a lot of um you know, resources like finance and legal and pharmaceutical companies. But then again, is it really A_G_I_ again, because we are now specialising it again. And then again, is it really that much different from back in the day how we had specialised algorithms I? I think it's just the same thing more way more so s sophisticated, but w I dunno is, there a threshold when we call it A_G_I_ I guess I? think the the real cool thing is here that we have like the the foundation models that we can specialise. I think that that's like the the break-through at some point right now. I think we are not there yet because well, m first uh it's too expensive, but also you, know, like J_G_P_D_ doesn't just give away that J_G_P_D_ to customise it. I think once that's gonna be true in a some w and I think I can imagine this as a business model, that uh J_G_P_D_ op may I say at some point like hey, you know, Bank of America for a hundred million, we will do your custom model or something like that. And I I think that will be huge economic uh value add. The other thing though is also companies, I mean right now what is the differentiating factor? I mean if everyone uses the same L_L_M_, if everyone uses uh G_P_D_ they will all do the same thing again. I mean then well it it's everyone is moving in lockstep, but usually companies they want to have a competitive advantage and I think there's the no way around using some of their private data and experimenting and maybe specialising. Uh it's gonna be interesting, yeah. Sitting in the pace of progress it does just feel like things are coming. I don't think the A_G_I_ and A_S_I_ thresholds are particularly n useful. I d I think I guess the real question and this takes us to the remote worker thing is when are we going to see a d a big obvious leap in a c economic impact? 'Cause currently there's not been an an obvious leap in economic impact of L_L_M_ models for example. And that's you know aside from A_G_I_ or A_S_I_ or all that kind of stuff there's a real question of like when are we gonna see a G_D_P_ like Mm-hmm mm-hmm. Yeah, it's like what is the G_D_P_ made up of? Like a lot of it is like financial services, so like I don't I don't know what this is. Uh it's just hard for me to think about the G_D_P_ bump, but like I'd say that software development becomes valuable in a different way when you no longer have to look at the code anymore. So when when it is like cloud'll make you a b small business, which is essentially cloud can set up your website, your bank account, your email and your whatever else, and like you just have to express like what you're trying to put into the world, like that's not just a d enterprise market, but it is a hard like I don't know how you get people to try doing that. I guess if Chad G_P_T_ can do it, like people are trying Chad G_P_T_. I think it boils down to the the scientific question of how hard is tool use to solve. There's a lot of the stuff you're implying, the remote work stuff is t tool use. It's like how computer use, like how you have an L_M_ that goes out there, this agentic system, and does something in the world and only screws up one percent of the time. computer use is a good ex example of what labs care about and we haven't seen a lot of progress on. We saw multiple demos in twenty twenty five of like cloud can use your computer or OpenAI had Kua and they all suck. So like they're also investing money in this and they think that'll be a good example. Where that's actually something where it just seems for the model to work in. Like they're not working on your Mac book. They are individually interfacing with Google and Amazon and Slack and they handle all these things in a very different way than humans do. So some of those might be structural blockers. Also like specification-wise I think the problem is also for you know arbitrary tasks uh well you still have to specify what you want your L_L_M_ to do and how do you do that in a w w what is the environment, how do you specify uh you can say what the end goal is but w w if it can't solve the end goal with L_L_M_s if you ask it for text you can always you know clarify, do sub-steps, what is uh how how do you put that information into a system that that say books a travel trip for you, you can say well you screwed up my credit card information but even to get it to that point, like how do you like as a user guide the model uh before like it can even attempt that. I th I think the interface is really hard. Yeah, it has to learn a lot about you specifically and about this goes to continue continue learning ab about the general mistakes that are made throughout and then mistakes that are made through you. Yeah. Mm-hmm. Mm-hmm. Engage. Some people really like this pulse feature, which is it processes your chats and automatically searches for information and puts it in the chat G_B_T_ app. So there's a lot of things coming to that. I used that feature before and I always feel bad because it does that every day and I rarely check it out. It's like how much money uh like I mean compute is burned on something I don't even look at, you know, where it's like it's kind of like old f yeah, sure. Okay. Do you Uh new ideas might be needed. Is it possible that the path to A_G_I_, whatever that is, however defined that, to solve computer use more d more generally, to solve biology and chemistry and physics, sort of the Dario definition of A_G_I_ or powerful A_ I_ do you think t is possible that totally new ideas are needed? Non-L_L_M_, non-R_L_ ideas. What might they look like? This is we're not going into philosophy land a little bit. for something like a singularity to happen, I would say yes. And the new ideas could be architectures or training algorithms, which is like fundamental deep learning things. But there's in that nature pretty hard to predict and I but I think we will get very far even without those advances. Like we might get this software solution, but it might stop at software and not do computer use without more innovation. So I think that it's like a lot of progress will be coming, but in if you're gonna zoom out like there's still ideas in the next thirty years that are gonna look like that was a major like scientific innovation that enabled the next chapter of this, and I don't know if it comes in one year or in fifteen years. Yeah, well I wonder if the bitter lesson holds true for the next hundred years, what that looks like. If scaling laws are fundamental in deep learning, I think the bitter lesson will always apply, which is compute will become more abundant, but even within abundant compute, the ones that have a steeper scaling law slope or a better offset, like this is a two D_ plot of performance in compute, and like even if there's more compute available, the ones that get a hundred X_ out of it will win. It might be something like literally compute clusters orbiting Earth. solar panels. The problem with that is heat dissipation. So you get all the radiation from the sun and you don't have any air to his dissipate heat, but there is a lot of space to put clusters. There's a lot of solar energy there and you could figure out the heat dissipation, but there is a lot of energy and there probably could be engineering will to solve the heat problem, so there could be. Is it possible, and we should say that it definitely is possible how likely it is uh is the question, that we're basically going to be plateauing this year. Not in terms of the system capabilities, but what the system capabilities actually mean for human civilisation. So on the coding front, really nice websites will be built. Um very nice autocomplete, very nice uh way to understand code and maybe help debug, but really just a a very nice helper on the coding front. It can help research mathematicians do some math. It can help you with shopping, it can help you with it could help it's a nice helper, it's clippy on steroids. Uh what else? It may be a good education tool and all that kind of stuff, but computer use turns out extremely difficult to solve. So I'm trying to be uh uh I'm trying to frame the clinical case in all these domains where it kinda there's not a really huge economic impact, but we realise how costly it is to train these systems at every level, both the pre-training on the inference, how costly the inference is, the reasoning, all of that. Uh like is that possible and how likely is that do you think? you look at the models there's so much obvious things to improve and it take a long time to train these models and to do this art and that it'll take us with the ideas that we have multiple years to actually saturate in terms of whatever benchmark or performance we are searching for, it might serve very narrow niches, like the average strategy B_T_ eight hundred million user might not get a lot of benefit out of this, but it is going to serve different populations by getting better at different But I get I think what everybody's chasing now is the is uh a general system that's useful to uh everybody. So okay, so if that's not that can plateau, right? I think that dream is actually kind of dying. As you talked about with the specialized models where it's like t and multi-modal is often a t like video generation is a totally different thing. That dream is kind of dying is a big statement. 'Cause I don't know if it's dying. I don't know if every I don't if you ask the actual financial lab people, they I mean they're still chasing it, right? do think they are still um like r rushing to get the next model out which will be much better than the pr not just a m uh relative term but will be better than the previous one and I d I d I can't see them slowing down. I just think the gains will be made or felt more through not only scaling the model but now fine so I I feel like there's a lot of tech that is like well let's just put the better model in there and m m better model and better model and now uh people are okay let's also at the same time everything around it to like you know like the engineering of the context and inference scaling and I the big labs will still keep doing that and now also the smaller labs will catch up to that because now uh it's just like they are hiring more, there will be more people, L_L_M_s, it's kind of like you know like a a circle, they also make them more productive and it it's just a it's like amplify. I think what we can expect is amplification but not like uh a change of an like a paradigm change, I don't think that is true but w the b everything will be just amplified and amplified and I didn't I I couldn't see that continuing for a long time you, know. Yeah, I guess my statement with the dream is dying depends on exactly what you think it's gonna be doing. Like cloud code is a general model that can do a lot of things, but it's not like necessarily like it depends a lot on integrations and other things. Like I bet cloud code could do a fairly good job of doing your email and the hardest part is figuring out how to give the information to it and how to get it to be able to send your emails and stuff like this. But that's just kinda like I think it goes back to like what is the one model to rule ethos, which is just like a thing in the cloud that handles your entire digital life and is way smarter than everybody. It's like it's operating in a It's a it's a an interesting leap of faith to go from cloud code becomes that, which i like in some ways is there's some avenues for that, but I do think that like the rhetoric of the industry is a little bit different. I think the immediate also thing we will feel next as a normal person using L_L_M_s is will will probably be related to something like also trivial, like making figures. Uh right now uh L_L_M_s are terrible at making figures. Is it because we are getting served the cheap models with very less or like l less uh inference compute than behind the scenes? Maybe some like there are some cranks we can already get better figures. But if you ask today, I don't draw a flow chart of X_Y_Z_, it's most of the time terrible. And it is kind of like a very simple task for a human. I think it's almost e easier sometimes to draw something than to write something. Yeah, the multi-modal understanding does feel like something that is odd that it's not better solved. I I think we're not saying one actually obvious thing that we're not actually realising that's a gigantic thing that's hard to measure, which is making all of human knowledge accessible to the entire world. Like we d I don't I d one of the things that I think is hard to articulate, but there's just the huge difference between Google search and an L_L_M_. Like I feel like I can basically ask an L_M_ anything and get an answer. And it m is doing less and less and less hallucination. And that means understanding my own life, figuring out a career trajectory, figuring out how to solve the problems all around me, uh learn about anything through human history, that like w I I feel like nobody's really talking about that uh because they just immediately take it for granted that it's just this is awesome. That's why everybody's using it, is 'cause you get answers for stuff. the impact of that across time. Like think about this is not just in um United States, it's all across the world, like kids throughout the world being able to learn these ideas, like w the impact that has across time is g is prob that's where the real like g talk about G_D_P_ it, won't be like a leap, it'll be that's how we get to Mars, that's how we build these things, that's how we have a mm a million new open A_Is_ uh all the kind of innovation that happens from there. And that's just this quiet force that permeates everything, right. Human knowledge. I do agree with you and uh in a sense uh you make it makes knowledge more accessible, but um it also I think depends on what the topic is. For something like math um in a sense you can ask it questions, it answers, but if you wanna learn a topic from scratch uh I think it th again like we talked about this earlier, I I think the sweet spot is I mean there are really good math textbooks where someone laid it out linearly and that is like um that's a proven strategy to learn this topic. And it does make sense if you start from zero to ramp up to get like a like a information dense text to soak it up. But then you use the L_L_M_ to make infinite exercises. Like you you have problems in a certain area and uh or have questions that something's unsh uh uncertain. Or like you are uncertain about certain things. You ask it to generate example problems, you solve them and you have questions and then m maybe you need you need more background knowledge and you ask it to to generate that. And I think but th then the won't give you anything let's say that is not in the the the textbook, it's just packaging it differently if if that makes sense. But then there are things I feel like where it also adds value in a more I mean timely sense where there is no good alternative besides a human doing it on the fly. For example if you I dunno like let's say you're planning to go to Disneyland and you fe try to figure out which tickets to buy for which park when, well there is no textbook on that, there is no information dense resource and that is only the sparse internet. And then there is a lot of value in the L_L_M_. You just ask it, it has you have the constraints I'm, travelling these and these days. I want to go there and there. Please figure out what I need when and from where and what I'm what it costs and stuff like that. And it it is very customised on the fly uh package and then this is like one of the thousand examples an exercise personalised uh personalisation is essentially like pulling information from the sparse internet, the non-information dense uh thing where it there is no better version that exists, it it's just doesn't exist, you make it from scratch almost. And if it does exist, it's full of uh speaking of Disney World like full of what would you call it, ad slob? Like you just it's impossible to g uh here, you go g any city in the world. W what are the top ten things to do? L_M_ is just way better to ask than anything on the internet. That's 'cause they're massively subsidised and they're gonna be paid for by ads. Uh-huh. Uh-huh. Uh-huh. Uh-huh. It's coming. maybe comes up first, maybe, maybe not. And so but I think there are clear laws around this. You have to be clear about that. But I think that's what everyone fears. It's like the subtle um you know, m subtle message in there or something like that. But it also brings us to the topic of uh I guess ads uh where I think this was the thing op may I try to launch in two thousand twenty five and uh just to m because I think it's still not uh making money in that other way uh right now. So that like having really like ad spots in there and then the though is they couldn't because well uh there are alternatives without ads and people would just flock to the other products and it also is is just like crazy how yeah like they're one upping each other spending so much money to just get the users. I think so like some Instagram ads I don't use Instagram but I understand the appeal of paying a platform to find users who will genuinely like your product and that is the best case of things like Instagram ads. But there are also plenty of cases where advertising is very awful for incentives and I think that a world where the power of A_I_ can integrate with that positive view of like I am a person and I have a small business and I want to make the best, I dunno, damn steak knives in the world and I want to sell them to somebody who needs them and if like if AI can make that sort of advertising thing work even better, that's very good for the world, especially with like digital infrastructure, because that's how like the modern web has been built. But that's not to say like addicting feeds so that you can show people more content is a good thing. So it's like I th I think that's even what opening AI would say is they want to find a way that can make the monetization upside of ads while still giving their users agency. And I m I personally would think that Google is probably gonna be better at figuring out how to do this 'cause they have s they already have ad supply and they figure out how to turn this demand in their Gemini app into useful ads then, they can turn it on and somebody will figure I don't know if I think it's this year but there will be experiments with it. I do think what holds companies back right now is really just the that the competition is not doing it, it's more like more like a reputation thing. It it's just like I think people are just afraid right now like ruining or like losing their reputation losing, users because it is it would make headlines if someone launched these ads. Mm-hmm. like that where it will say like promote it or something like small and then there will be an image or something. I th I think right now the problem is who makes the first move. If we go ten years out the proposition for ads is that you will make so much money on ads by having so many users that you can use this to funnel better R_ and D_ and make better models which is why like YouTube is dominating the market for any like Netflix is scared of YouTube like they have the ad like they make I don't I pay twenty eight dollars a month for premium they make at least twenty eight dollars a month off of me and of many other people and they're just like creating such a dominant position in video. So I think that's the proposition which is that ads can make you have a sustained advantage in what you're spending per user. But there's so much money in it right now that it's like like somebody starting that flywheel is scary 'cause it's a long term bet. Uh do you think there'll be some like crazy big moves this year business-wise? Like somebody like Google or Apple acquiring anthropic or something like this? Daria will never sell, but uh we are starting to see some s types of consolidation with like GROC for twenty billion dollars and um scale A_I_ for almost thirty billion and countless other deals like this that they're structured in a way that is actually detrimental to the Silicon Valley ecosystem, which is this sort of licensing deal where not everybody gets brought along rather than a full acquisition that benefits the rank and file employee by getting their stock vested. Like that's a big issue for Silicon Valley culture to address because the start-up ecosystem is the lifeblood where if you get a co if you join a start-up, even if it's not that successful, your start-up very well might l get acquired on the cheap premium of it and you'll get paid out for this equity and these licensing deals are essentially taking the top talent a lot of the times. I think GROC they deal for GROC to NVIDIA is rumored to be better to the employees, but it is still this anti-trust avoiding thing, but I think that this trend of consolidation will continue. been me and many smart people I respect have been expecting col consolidation to have happened sooner, but it seems like some of these things are starting to turn which but at the same time you have companies raising ridiculous amounts of money for m m reasons that you don't like. I'm like I don't know why you're taking that money. So it's maybe like mixed this year, but some consolidation pressure is starting. What kind of surprising consolidation do you think we'll see? So you're saying a topic is is a never I mean GROC is a big one, GROC with a cube by the way. Yeah. There's just a lot of start-ups and there's a very high premium on A_I_ start-up. So there's a lot of like there could be a lot of ten billion range acquisitions, which is a really big acquisition for a start-up that was maybe founded a y like a year ago, I think Main S_A_I_ from this company that's based in Singapore that Meta founded was founded eight months ago and then had a two billion dollar exit. And I think that there'll be some other big like many billion dollar acquisitions like Perplex yeah, like uh people rumoured them to Apple. I think there's a lot of pressure and liquidity in A_I_. There's pressure on big companies to have outcomes and I've I would guess that a big acquisition gives people leeway to then tell the next chapter of that story. I mean yeah, it's a c I guess cursor. We've been talking about code and somebody acquires cursor. they're in such a good position by having so much user data. And we talked about continual learning and stuff. They had one of the most interesting like two sentences and a blog post which is that they had their new composer model which was a fine tune of one of these large mixture of expert models from China. You can know that by asking Gossip or because the model sometimes responds in Chinese which none of the American models do. And they had a blog post where they're like we're updating the model weights every ninety minutes based on real world feedback from people using it which is the closest thing to real world R_L_ happening on a model. And it's just like in one of their blog posts which is super cool. And uh and by the way I just I say I use composer a lot 'cause it's one one of the benefits that it has is it's fast. I need to try it 'cause everybody says this. And there'll be some I_P_O_s potentially. You think Nthropic, Open A_I_, X_A_I_? they can all raise so much money so easily that they don't feel a need to like so long as fundraising is easy they're not gonna I_P_O_ because public markets apply pressure. I think we're seeing in China that the ecosystem's a little different with both minimax and Z_ dot A_I_ applying for um filing I_P_O_ paperwork which would be interesting to see how the Chinese market reacts. I actually would guess that it's gonna be like similarly hypey to the U_S_ so so long as all this is going and not based on the realities that they're both losing a ton of money. I wish more of the American's gigantic A_I_ start-ups were public because it would be very interesting to see how they're spending their money and have more insight and also just to give mm people access to investing in these 'cause I think that they're some of the most like formative like they're the t companies of the era and the tradition is now for so many of the big start-ups in the U_S_ to not go public. It's like we're still writing for Stripe and the A_I_P_O_ but Databricks definitely didn't, they raised like a series G_ or something. And I just it's a kind of a weird equilibrium for the market where it's like I would like to see these companies go public and evolve in that way that a company can. You think ten years from now some of the frontier model companies are still around, anthropic open A_I_? I definitely don't see it to be a winner takes all unless there truly is so algorithmic secret that one of them finds like let's say flywheel 'cause the development path is so similar for all of them. Google and open A_I_ have like all the same products and then like anthropic's more focused but when you talk to people it sounds like they're solving a lot of the same p uh problems so I think and there's offerings that'll spread out there's, a lot of it's a very big cake that's being made that people are gonna take money out of. I I don't wanna trivialise it but uh so open A_I_ and and anthropic are primarily L_L_M_ service providers and some of the other companies like Google and X_A_I_ linked to X_ does other stuff too. And so it's very possible if A_I_ becomes more commodified that the companies that are just providing L_O_M_ will die. I think they will the pr the advantage they have, they have a lot of users and I think they will just pivot. I think um then uh if they figure out uh so like entropic I think pivoted. I don't think they originally um planned to work on code, but it happened that they found okay this is like a nice niche and now we are comfortable in this niche and we push on this niche and I can see the same thing once. Maybe let's say hypothetically speaking I I'm not sure it will be true, but let's say Google takes all the market share of the general chatbot. Maybe open I will be then f m focus on some other sub uh topic like the they have too many users to go away in in foreseeable future, I think. I think Google is always ready to say whoa, might be over the A_I_ mode. I think that the question is if the companies can support the valuations. I think I'd see the A_I_ companies being looked at in some ways like A_W_S_, Azure and G_C_P_ are all competing in the same space and all very successful businesses. There's a chance that the A_P_I_ market is so unprofitable that they go up and down the stack to products and hardware. They have so much cash that they can build power plants and build data centres, which is a durable advantage now. But there's also just a reasonable outcome that these A_P_I_s are so and so flexible for developers that they become the likes of like a something like A_W_S_. But A_W_S_ and Azure are also gonna have these A_P_I_s. So there's some like that's a s like five or six people competing in the A_P_I_ market is hard. So maybe like that's why they get squeezed out. You mentioned R_I_P_ llama. Is there a path to winning for meta? I think nobody knows they're moving a lot. So they're signing licensing deals with um Black Forest Labs, which is the image generation or mid-journey or client ma mainus. So I think in some ways it's on the product and like consumer facing A_I_ front, it's too early to tell. I think they have some people that are excellent and o very motivated being close to Zuckerberg. So I think that there's still a story to unfold there. LAMA is a bit different, where LAMA was the most focused expression of the organisation and I don't see LAMA being um supported to that extent. I think it was a very successful brand for them, so they still might do some part of participation in the okem open ecosystem or continue the LAMA brand into a different surface. Does people know what LAMA is? Do you think there's a LAMA five? Not an open-weight one. uh it's interesting I think also just to recap a bit I think I mean LAMA was the I would say pioneering open weight model and then LAMA one two three lot of love but I think then I think what happened uh just hypothesizing or speculating I think the um leaders at META like the upper uh executives they I think they got really excited about LAMA because they saw how popular w it was in the community and then I think the problem was trying to let's say monetize the open s uh not monetize the open source but like kind of use open source to make a bigger splash in a se like to kind of force it almost i it felt forced like developing these very big llama four models to have like the best like to be on the top of the benchmarks. But I don't think the goal of llama models is to be on top of the benchmarks, beating let's say ChatterPity or other models. I think the goal was to have a model that people can use, trust uh, modify, understand it, so and that includes having smaller models. They don't have to be the best models and what happened was just these models were of like the benchmarks suggest that they were better than they were by, because I think they had like specific models trained on preferences that they performed well on the benchmarks. So it's kind of like this overfitting thing to kind of force it to be the best. But then at the same time uh they didn't do the small models that people could use. I I think that no one could run these big models then. And then there was kind of like a weird thing. And I think it's just because people got too excited about uh headlines pushing the frontier, I think I think that's a good thing. Yeah, like I th I think it imploded under political pri like internal political fighting and misaligned incentives. So I think the researchers wanna build the best models, but there's a layer of organisation and manager that is trying to demonstrate that they do these things. And then there's lots of there's a lot of pieces and rumours where how like some horrible technical decision was made and how that comes in. And it just seems like it kind of got too bad where it all just crashed out. But w we shou we should Mm-hmm. also like gives huge props to Mark Zuckerberg. Th I think it comes from Mark actually. From Mark Zuckerberg from the top of the leadership saying open source is important. I think that's like that if the the fact that that exists means there could be a llama five where they learn the lessons from the bench maxing and say we're gonna m be G_P_T_O_S_S_ and m and provide really awesome library of open source. what people say is that there's a debate between Mark and Alexander Wong, who is very bright, but much more against open source, and to the extent that he has a lot of influence over the A_I_ org, it seems much less likely. Because it seems like Mark brought him in for like a fresh um leadership aid in directing A_I_, and if the like open or closed is no longer the defining nature of the model, I don't expect that to be a defining argument between Mark and Alex. So like they're both very bright. But I just I have a hard time understanding all of it, because Mark wrote this piece in July of twenty twenty four maybe, which was like probably the best blog post at the time saying the case for open source A_I_ and then July twenty twenty five came around and it was like we're reevaluating a relationship with open source. So it's just kinda like but I think also the the problem uh not the problem, but I think well we may have been a bit also too harsh I think and that caused some of that because I think a I mean we as open source developers or the open source community because I think even though the model was maybe not what everyone hoped for, it got a lot of backlash and I think that was a bit unfortunate because I can see that as a company now they were hoping for positive headlines and uh m instead of just getting no headlines or not these positive headlines on in in turn they got negative headlines. And then all it kind of reflected bad on the company and I think that is also something like where you it's maybe a spite reaction, it was like we have not n we we tried to do something nice, we tried to give give you something cool like an open source model, and now you are like you know mm kind of like be negative about us even like like for the company so in that sense it looks like well maybe then we'll change our mind I guess I, dunno. Yeah that's that's where the uh the dynamics of discourse on X_ can lead us as a community astray 'Cause. sometimes it feels random people pick the thing they like they don't like. Maybe we could see the same thing with GROC uh for one and GROC code FAST one. I don't think vi-wise people um l love it publicly. But a lot of people use it. So if you look to Reddit and X_ they don't really give it praise from the programming community. But like they use it. And the same thing with probably with llama. I don't understand I don't understand the dynamics of either positive hype or negative hype. I don't understand it. I mean the story of one of the stories of twenty twenty five is the U_S_ feeling the gap of llama, which is like all the rise of these Chinese open-wave models to the point where I was like that was the single issue I've spent a lot of energy on in the last five months is like trying to do policy work to get the U_S_ to invest in this. It's like mm-hmm. So should tell me the story of Adam. Adam Project is it started as me calling it the American deep-seek project which doesn't really work for D_C_I_ audiences. But it's the story of like what is the most impactful thing I could do with my career, which is that these Chinese open-wave models are cultivating a lot of power and there is a lot of demand for building on these open models, especially in enterprises in the U_S_ that are very cagey about these Chinese models. Going to perplexity, the Adam Project, American truly open models is a U_S_ based initiative to build and host high-quality, genuinely open-weight A_I_ models and supporting infrastructure explicitly aimed at competing with and catching up to China's rapidly advancing open-source A_I_ ecosystem. I think the one sentence summary would be that or two sentences. One is a proposition that open models are going to be an engine for A_I_ research because that is what people start with, therefore it's important to own them. And the second one is therefore the U_S_ should be building the best models so that the best researcher happens in the U_S_ and those U_S_ companies take the value from being the home of where A_I_ research is happening. And without more investment in open models we have all the plots on the website where it's like Quinn Quinn Quinn Quinn and it's all these models that are excellent from these Chinese companies that are cultivating influence in the U_S_ and China and internationally and I think the U_S_ is spending way more on A_I_ and the ability to create open models that are half a generation or a generation beyond what the cutting edge of a closed lab says costs orders of like hundred million dollars which is a lot of money but not a lot of the money to these companies. So therefore we need a centralising force of people who want to do this, and I think we got signed engagement from people pretty much across the full stack, whether it's policy. So there has been support from uh the administration? I don't think anyone in the like technically in government has like signed it publicly, but I know that people that have worked in A_I_ policy, both in Biden and Trump administration are very supportive of trying to promote open source models in the U_S_. I think for example A_I_ two got a grant from the N_S_F_ for a hundred million dollars over four years, which is like a b the biggest C_S_ f grant the N_S_F_ has ever awarded, and it's for A_I_ two to attempt to this, and I think it's a starting point. But the best thing happens there are multiple organisations building models because they can cross-pollinate ideas and kind of build this ecosystem. Like I don't think if it just works if it's just LAMA releasing models to the world because then you can see LAMA can go away. The same thing applies for A_I_ too where it's like I can't be the only one building models. And I think that it's like that it becomes a lot of time spent on talking to people whether they're in policy. I know Nvidia is very excited about this. I think Jensen Wong has been specifically talking the urgency for this, and they've chain they've done a lot more in twenty twenty five where the Nemetron models are more of a focus, they've started releasing some data along with NVIDIA's open models and like very few companies do this, especially of NVIDIA's size, so like there is there is signs of progress and there we hear about reflection A_I_ where they say their two billion dollar fundraise is dedicated to building U_S_ open models and I feel like their announcement tweet is like it reads like a blog post sound rate and I think that that cultural tide is starting to turn. I think in in July was when we had like four or five deep sea caliber Chinese open weight models in zero from the U_S_. And that's a that's the moment where I was released to this and I was like oh I guess I have to spend energy on this 'cause nobody else is gonna do it. So it takes a lot of it takes a lot of people contributing together and I don't say that like the ADAM project isn't like the thing that's helping to move the ecosystem, but it's people like me doing this sort of thing to get the word out. Uh do you like the the twenty twenty five America's A_I_ action plan? That includes open source stuff. The White House A_I_ action plan includes a dedicated section titled encourage open source and open-weight A_I_ defining such models and arguing they have unique value for innovation and startups. Yeah. I mean like the A_I_ action plan is a plan, but largely I think it's like maybe the most coherent policy document that has come out of the administration and I hope that it largely succeeds and I know people that have worked on the A_I_ action plan and the challenge is taking policy and making it real and I have no idea how to do this as an A_I_ researcher, but like there's like largely a lot of things on that were very real and there's a huge build-out of A_I_ in the country and it's like there are a lot of issues that people are hearing about water use to whatever and like we should be able to build things in this country but also we'd need to not ruin places in our country in the process of building it and it's worthwhile to spend energy on. I think that's a role that the federal government plays it's like they set the agenda and with A_I_ setting the agenda that open weight should be a first consideration is like that's a large part of what they can do and then people think about it. Also for education and uh talent for these companies, it's I think very important because otherwise you know if there are only closed um models, how do you get the next generation of people contributing at some point? Because otherwise you will at some point only be able to learn w after you join a company, but then at the that point like how do you hire talented people, how do you uh identify h talented people. And I think open source is that's even uh for a lot of things, but also even just for educating population and training the next generation of researchers. It's the way or the only way. The way that I could have gotten this to more go more viral is was to tell a story of Chinese A_I_ integrating with an authori authoritarian state and being A_S_I_ and taking over the world and therefore we need our own American models. But it's very intentional for why I talk about innovation and science in the U_S_ because I think it's both more realistic as an outcome, but just like it's like it's a world that is that would like to manifest. I would say though also even like let's say uh any open weight model I do think is uh a variable model. Yeah. And my argument is that we should be in a leading position. But I think that it's worth saying it so simply because there are still voices in the A_I_ ecosystem that say we should consider banning releasing open models due to the safety risks. And I think it's worth adding that I think effectively that's impossible without making the U_S_ like it have its own great firewall, which is also known to not um work that well because the cost for training these models, whether it's one to a hundred million dollars, is attainable to a f huge amount of people in the world that want to have influence. So these models will be getting trained all over the world. And these we want the f models especially whe like I mean there are safety concerns, but we want these information and tools to flow freely across the world and into the U_S_ so that we people can use them and learn from them. And we like stopping that would be such a restructuring of our internet that it seems impossible. do you think maybe in that case the big open weight models from China are actually a good thing in a sense like for the uh U_S_ companies because maybe the U_S_ companies you you mentioned earlier there are usually one generation behind in terms of what they release open source versus what they are using for example G_P_T_O_S_ as as might not be the cutting edge model, JAMA three might not be um but they do that because they know this is safe to release but then when they see these companies see for example there is deep seek version three point two which is really mm awesome and it gets and there is no backlash, there is no security risk, that could then again encourage them to release better models. Maybe that m that in a sense is a very positive thing. A hundred percent. These Chinese companies have set things into motion that I think would potentially not have happened if they were not all releasing models. So I think with this like m I'm I'm almost sure that those discussions have been had by leadership. Is there a possible future where the dominant models, A_M_ models in the world are all open source? depends on the trajectory of progress that you predict. If you think saturation and progress is even coming within a few years, though essentially within the time where financial support is still very good, then open models w will be so optimised and so much cheaper to run that they will win out. Essentially this goes back to open source ideas where so many more people will be putting money into optimising this serving of these open-weight common architectures that they will become standards, and then you could have chips dedicated to them and it'll be way than the um offerings from these close companies that are custom. We should say that A_I_ twenty seven report kinda predicts one of the things it does from a narrative perspective is that there'll be a lot of centralization. As the A_I_ system gets smarter and smarter, the national security concerns will s come to be and you'll centralize the labs and you'll become super secretive and there'll be this whole race from a military perspective of how do you between China and the United States. And so all of this fun conversations we're having about L_M_s all the s the soldiers will come into the room and be like alright, we're now on the Manhattan project stage of this whole thing. I think two thousand twenty five six seven twenty seven I don't think something like that is even remotely possible I. mean you you can make the same argument for computers right, you can say okay computers are capable and we don't want the general public to get them or chips even A_I_ chips but you see how like you know Huawei makes chips now you know took a few years but and I think tha I don't think there is a way you can contain something like that like knowledge like that I. I think th in this day and A_ age it impossible, like the internet, I don't think this is a possibility. On the Manhattan Project thing, one of my funny things making out of 'em is I think that like a Manhattan Project-like thing for open models would actually be pretty reasonable 'cause it wouldn't cost that much. But I think that that will come, but it seems like culturally the companies are changing. But I agree with Sebastian on all the the stuff that you just said, it's just like I don't see it happening nor being helpful. Yeah, I mean the the motivating force behind the Manhattan Project is there a civilizational risk. How do you I it's harder to m motivate that for open-source models. There's not civilizational risk. Uh you think uh on the hardware side, we mentioned NVIDIA c uh a bunch of times, do you think Jensen and and video are gonna keep winning? I think they have the downside that they have to iterate a lot and uh manufacture a lot and I think they probably what what they're doing it they do innovate but um I think there's always the chance that there is something who does something fundamentally different who gets very lucky and then does something but the problem is I think adoption you know like the p the mode of NVIDIA is probably not just the G_P_U_ it's more like the CUDA ecosystem and that has evolved over so many t I think two decades. I think I even back when I was a grad student we I was in a lab uh we did biophysical simulations molecular, dynamics and we had a Tesla G_P_U_ back then just for the computation that was b are in not fifteen years ago now and it just they built this up for a long time and uh that's like that's the mode I think it's not the the chip itself although they have now the money to iterate and build and scale but, then it's really on the compatibility it's like well if you're at that scale as a company why would you go with something risky where it's only a few chips that they can make per year, uh you go with a big one. But then I do think with L_L_M_s now also it will be easier to design something like CUDA. You know like the ne so it took fifteen years because it's hard. But then now we have L_L_M_s we can maybe replicate CUDA. A separation of the training and the inference compute. As we kind of stabilise a bit more and more and more uh uh computers needed for inference. that's supposed to be the point of the grok acquisition. And and that's why part of what Vera Rubin is where they have a new chip with no high band-width memory which is one of the or very little, which is one of the most expensive pieces. It's designed for pre-fill which is the part of inference where you essentially do a lot of matrix multiplications and then you only need the memory when you're doing this autoregressive generation and you have the K_V_ cache swaps. So they have this new G_P_U_ that's designed for that specific use case and then the cost of ownership per per flop or whatever is actually way lower. But I I think that Nvidia's fate lies in the diffusion of A_I_ still. Their biggest clients are still these hyperscale companies, whether it's like Google obviously can make T_P_Us, Amazon is making Tranium, Microsoft will try to do its own things, and like so long as the pace of A_I_ progress is high, Nvidia's platform is the most flexible and people will want that. But if there's stagnation, then creating bespoke chips, there's more time to do it. It's interesting that uh Nvidia's is quite active in trying to develop all kinds of different products. They tried to create areas of commercial value that will use a lot of T_P_Us. Mm-hmm. But they keep innovating and and there there's a l like they're doing a lot of incredible research, so. Everyone says that the company is super oriented around Jensen and how operationally plugged in he is. And it sounds so unlike many other big companies that I've heard about. And so long as that's the culture, I think that I will expect them to keep progress happening. And it's like he's still in the Steve Jobs era of Apple. So long as that is how it operates, I'm pretty optimistic for their situation. Because it's like it is their top order problem. And I don't know if making these chips for the whole ecosystem is the top goal of all these other companies. They'll do a good job, but it might not be as good of a job. Since you mentioned Jensen, I've been um reading a lot about history and about singular figures in history. What do you guys think about the single-man-woman view of history? How important are individuals for steering the direction of history in the tech sector? So you know, what's in video without Jensen? You mentioned Steve Jobs. What's Apple without Steve Jobs? What's X_A_I_ without Elon? Or Deep Mind without Demis. People make things earlier and faster, where scientifically many great scientists credit to being in the right place at the right time and still making the innovation, where eventually someone else will still have the idea. So I think that in that way Jensen is helping manifest this G_P_U_ revolution much faster and much more focused than without having a person there it would do. And this is making the whole A_I_ build out faster. But I do still think that eventually like something like chat G_P_T_ would have happened and a build out like this would have happened, but it m probably would not have been as fast. Or like a like I think that's the sort of flavour that is applied. People these individual peoples are people who are placing bets on something. Some get lucky, some don't. But if you don't have these people at the helm, it will be more diffused. It's almost like investing in a E_T_F_ versus individual stocks. Individual stocks m uh m might go up, might go down more heavily than an E_T_F_ which is more balanced. It will eventually go up over time and we'll get there. But it it's just like you know like focus I think is the thing Uh. passionate focus. Isn't there a real case to be made that without JETHSEN there's not a uh reinvigoration of the the deep learning revolution? It could have been twenty years later is the thing that I would say. Or like another A_I_ win like a deep learning winter could have come if G_P_Us weren't around. history completely. 'Cause uh you could think of all the other t technologies that that could have come in the meantime and the focus of human civilisation could uh the silicon value would be captured by a different hype. But I do think it is uh I mean there's uh certainly an aspect where it was all planned uh the G_P_U_ trajectory, but on the other hand it's also a lot of lucky coincidences. For example uh all good intuition like the investment into this let's say biophysical simulations or like I mean I think it started with video games and then it it just happened to be good at l linear algebra because video games require a lot of linear algebra and then you have the biophysical um simulations and then but still I don't think the plan the master plan was A_I_ I think there was just it happened to be Alex Krzyzewski. So someone took these G_P_U_s and like hey let's try to train a neural network on that and happen to work really well and I think it only happened because you could purchase those G_P_U_s. Mm-hmm. that's what I would think. Like I think that the G_P_U_s would have been different for the Alex n but I think like G_P_U_s would still exist at the time of AlexNet and at the time of the transformer. It was just hard to know if it would be one company as successful or multiple smaller companies with worse chips. But I don't think that's like a a hundred year delay. It might be a decade delay. I mean I just can't see Intel or AMD doing what N_V_I_ did on U_C_P_A_. Mm-hmm. Like silicon graphics or something. But it does like just look looking at it, it seems like these singular figures, these leaders have a huge impact on the trajectory of the world. Obviously incredible teams behind them. But you know having that kinda very singular almost dogmatic focus is necessary to make progress. Yeah, I mean even with uh G_P_T_ it wouldn't exist if there wasn't a person, Ilya, who pushed for this scaling, right? I mean Daria was also deeply involved in that. It almost seems wild thinking about how early these people were like we need to hook up ten thousand G_P_U_s and take all of open A_I_s, compute and train one model. There's a lot of people there that didn't wanna do that. Again singular figures. Speaking of which, hundred years from now, this is presumably post-singularity, whatever singularity is, when uh historians look back at our time now, what technological breakthroughs would they really emphasise as the breakthroughs that led to the singularity? So so far we have touring to today, eighty years. I think it would still be computing, like the umbrella term computing, just I don't necessarily think it's even like hundred years, two hundred years from now it would be A_I_ it would s uh it could be still well be computers, you know. Just we are now taking better advantage of computers, but like the fact of computing. It's a basically Moore's law kind of discussion. You're not even the details of code and G_P_U_s won't even be remembered. And it it won't be all the s software turmoil. It'll be just obviously compute. I would generally agree, but it's like is the connectivity of the internet and compute able to be merged uh or is that both of them? I think the internet uh will probably be related to yeah, I mean communication, that it it could be a phone internet, uh satellite, that stuff. Um where yeah, and compute is the more like the scaling aspect of it. It's possible that the internet is completely forgotten. The internet is wrapped into the phone networks, like communication networks. This is i just another manifestation of that. And the real breakthrough comes from the just the increased compute is the Moore's law broadly defined. Well, I think that connection of people is very fundamental to it. So it's like you can talk to anyone you wanna find the best person in the world or something, they are somewhere in the world. And being able to have that flow of information, the A_I_s will also rely on this. I think I've been fixating on the like um the d when I said the dream was dead about the one central model, and the thing that is evolving is like people have many agents for different tasks. People always start doing this with different clods for different tasks, and it's described as many A_G_I_s in the data center where each one manages and they talk to each other. And like that is so reliant on networking and free flow of information on top of compute. But like networking, especially with G_P_Us, is such a part of scaling up compute. Like the G_P_Us and the data centers need to talk to each other. Anything about neural networks will be remembered. Like do you think there's something very specific and singular to the fact that it's neural networks? That seems to break through like a genius that you're basically replicating in a very crude way the human mind, the structure the structure of the human brain, the human mind. I think without the human mind we probably wouldn't have neural networks because it just uh was an inspiration for that. But at the other end I think it's just so so different. I mean it's digital versus you know biological that I do think it it will probably be more like grouped as an algorithm. That's massively parallelizable on this particular kind of compute. Could have be well been like genetic computing, like geni genetic algorithms just as parallelized. I think it just happens that this is more efficient works, better you, know. And it very well could be that the L_M_ you know the neural networks the way we architect them now is just a small component of the system that leads to singularity. I think it's if you think of it a hundred years, like society I think can be changed more with more compute and intelligence because of autonomy, but it's like looking at looking at this like what are the things from the Industrial Revolution that we remember. We remember like the engine is probably the equivalent of the computer in this, but there's a lot of other like physical transformations that people are aware of, like like all the like the cotton gin and all these things that these machines that are still known, air refrigerators, like some of these things from A_I_ will still be known. Like the word transformer could still very well be known. I would guess that deep learning is definitely still learn known, but the transformer might be evolved away from in a hundred years of with A_S_I_A_I_ researchers everywhere. But I th I think deep learning is likely to be a term that is remembered. And I wonder what the air conditioning and the refrigeration of the future is that A_I_ brings. Is there uh if we travel f forward a hundred years from now, we transport there right now, what do you think is different? How do you think the world looks different? First of all, you think there's humans, you think there's robots everywhere walking around? I do think specialised robots for sure for, certain tasks. Um that I'm maybe half humanoid. We'll see. I I think for certain things yes, uh there will be humanoid robots because it's just uh amenable for the environment. But uh like for certain tasks it m might make sense. What's harder to imagine is how we interact with uh devices and what humans do with devices will I mean I I'm pretty sure will probably not be the cell phone, will probably not be the laptop, will it be you know implants. I mean it has to be a brain computer basis, right? I mean a hundred years from now it has to b like given the progress we're seeing now, there has to be unless there's legitimately complete alteration of how we interact with the reality. On the other hand, if you think of cars, cars are older than hundred years, right? And it's still the same interface, it's not we haven't replaced cars with something else, we just made the cars better, but it's still steering wheel, it's still wheels, you know. I think we'll still carry around a physical brick of compute because people want some ability to have a private like you might not have engaged with it as much as a phone, but having something where you could have private information that is yours as an interface between the rest of the internet I think is something that people will still exist. It might not look like an iPhone and it might be used a lot less, but I still expect to have people carry things around. private for you like encrypted messages encrypted, photos you, know what your life is. Like I guess this is a question on whether how optimistic on brain machine interfaces you are. If th is all of that just gonna be stored in the cloud in your whole calendar? Like it it's hard to think about processing all the information that we can process visually through brain machine interfaces presenting something like a calendar or something to you. Like it's hard to just think about knowing without looking you know your emailing box. Like you signal to an a computer and then you just know your emailing box. Like what does that like is that something that the human brain can handle being piped into it non-visually? Like I d I don't know exactly how those transformations happen 'Cause. it humans aren't changing in a hundred years. A local community yeah, like people you are close to, being able to do things with them and being able to ascribe mean like describe meaning to your life and to be able to do things. I think that that is ma if not in a hundred years, I don't think that human biology is changing away from those on a time scale that we can discuss and I think that like U_B_I_ does not solve agency. I do expect mass wealth and I hope that it is spread so that average life does look very different in a hundred years, that that's still a lot to happen in a hundred years if you think about countries that are early in their development process to getting access to computing and internet like to build all the infrastructure and to have policy that shares one nation's wealth with another is. It's I think it's an optimistic view to see all of that happening in a hundred years while they still being while they are still independent entities then not just like absorbed into some international order by force. But there could be just better, more elaborate, more effective social support systems that help alleviate some levels of basic suffering from the world. You know the, transformation of society where a lot of jobs are lost in the short term, I think we have to really remember that each individual job that's lost is a is a human being who's suffering. That's like a when jobs are lost that scale is a real tragedy. You can make all kinds of arguments about economics or it's it's all going to be okay, it's b it's good for the G_D_P_, there's going to be new jobs created, m fundamentally the individual level for that human being, that's that's real suffering. That's a real personal sort of tragedy and we have to not forget that uh as the technologies are being developed. And also my my hope for all the A_I_ slop we're seeing is that there'll be a greater and greater premium for the the fundamental th aspects of the human experience that are like in person, the things that we all like seeing each other talking together in person. The next few years are definitely gonna be an increased value on physical goods and events and an even more pressure on slop. So it'll be so there'll be keep the slop is only starting. The next few years will be more and more diverse versions of slop. Mm-hmm. on it. even like uh classic examples, I I honestly think this is true, and I I think we'll get tired of it. We are already kind of tired of it. Uh uh same with I mean even art. I don't think art will go away uh b mean you have paintings physical, paintings, there's m more value not, just monetary value, but just more value appreciation for something that is the actual painting than a photocopy of that painting. It could be a perfect digital reprint of that, but there is something when you go to a museum and you look at that art and you see that real thing and you just think about okay, a human I, it's like a craft you have an appreciation for that and I think the same is true for writing, for talking, for uh any type of experience where uh y it will be I do unfortunately think it will be like a dichomet uh like it will be like a fork where well some things will be automated like you know there are not as many paintings as they used to be m two hundred years ago. There are more more photographs, more photocopies. But at the same time it won't go away. There will be a you know value in that. I think that the difference will just be a bit you know, what's the proportion of that. But personally I I have a hard time reading things where I obviously see it's um obviously A_I_ generated. I'm like I'm sorry m there might be really good information there, but I have like a certain s nah not for me I, think. Eventually they'll fool you. And it'll be on platforms that give ways of verifying or building trust. So you will trust that Alexa's not A_I_ generated having been here. So then you have trust in this channel. But it's harder for new people that don't have that trust. Mm-hmm. is real, this is not real. There will be some tell ta tell science where you can obviously tell this is A_I_ generated and this is not. But they won't I mean some will be so good that it's hard to tell and then you have to trust and um well that that that will get interesting and a bit problematic. Mm-hmm. Mm-hmm. like human editing, which is the opposite of the discussion to try to watermark A_I_ images, and then you can make a Google image that has a watermark and use a different Google tool to remove the watermark. Yeah it's, gonna be Tom's racist. Yeah. I mean there's also the the c all the capabilities that we've been talking about can be used to destabilize human civilization with even just m relatively dumb A_I_ applied at scale and then further and further super-intelligent A_I_ systems. Of course there's the the the sort of do-mer-take that's important to consider a little bit as we develop these technologies. Um what gives you hope about the future of human civilization? Everything we've been talking about. Are we going to be okay? I think we we will. I'm I'm definitely a warrior both about A_I_ and non A_I_ things, but um humans do tend to find a way I. like that's what humans are built for is to have community and find a way to figure out problems and that's what has gotten us to this point. And to think that the A_I_ opportunity in related technologies is really big and I think that there's big social and political problems to everybody understand that. And I think that that's what we're staring at a lot of right now is like the world is a scary place and A_I_ is a very uncertain thing. And it takes a lot of work that is not necessarily building things. It's like telling people and understanding people that the people building A_I_ are historically not motivated or wanting to do, but it is something that is probably doable and just will take longer than people want. And we have to go through that long period of like hard to straw A_I_ discussions if we want to have the lasting benefits. Yeah, through that process I'm especially excited that we get a chance uh to better understand ourselves. Also at the individual level as humans and at the civilization level. answer some of the big mysteries like what is this whole like consciousness thing going on here seems to be truly special like there's a real miracle in our mind and A_I_ puts a mirror to ourselves and get to answer some of the big questions about like what what is this whole thing going on here? Well one thing about that is also what I do think uh makes us very different from A_I_ and why I don't worry about A_I_ taking over is like you said consciousness, we humans we decide what we want to do A_I_ in its current implementation and I can't see it changing, you have to tell it what to do. And so you have still the agency it doesn't take the agency from you because you have to you ju it's it becomes a tool you you can think of it as a tool you tell it what to do. It will be more than other previous tools, it's certainly more powerful than a hammer, it can figure things out, but it's still you in in in charge, right. So the eye is not in charge, you are in charge, you tell the AI what to do, and it's doing it for you. So in the post-singularity, post-apocalyptic war between humans and machines, you're saying humans are worth fighting for. A hundred percent I mean in this this is the the movie uh Terminator they made in the eighties essentially, and I do think well the only thing I can see going wrong is of course uh if things are explicitly programmed to do the thing that is harmful basically. I think actually in that in the Terminator type of setup I think humans win. Mm-hmm. I think we're too clever. Um it's hard to explain how we figure it out, but we do, and uh we'll probably be using local L_L_M_s open source L_L_M_s to help fight the machines. Um I apologize for the ridiculousness. Uh like I said Nathan already knows I've. been a a big fan of his for a long time, been a a big fan of yours Sebastian for a long time, so it's an honour to finally meet you Uh. thank you for everything you put out into the world, thank you for the excellent books you're writing, thank you for teaching us. Uh and uh thank you for talking today. This was fun. Thank you for inviting us here and having this human connection uh which is an extremely valuable human connection. Thanks for listening to this conversation with Sebastian Raschke and Nathan Lambert. To support this podcast, please check out our sponsors in the description where you can also find links to contact me, ask questions, give feedback and so on. And now let me leave you with some words from Albert Einstein. It is not that I'm so smart but I stay with the questions much longer. Thank you for listening and hope to see you next time.