Speaker 1
The following is a conversation all about the state-of-the-art in artificial intelligence, including some of the exciting technical breakthroughs and developments in A_I_ that happened over the past year, and some of the interesting things we think might happen this upcoming year.
At times it does get super technical, but we do try to make sure that it remains accessible to folks outside the field without ever dumbing it down. It is a great honor and pleasure to be able to do this kind of episode with two of my favourite people in the A_I_ community, Sebastian Raschka and Nathan Lambert.
They are both widely respected machine learning researchers and engineers who also happen to be great communicators, educators, writers, and Twitterers ex posters. Sebastian is the author of two books I highly recommend for beginners and experts alike. First is build a large language model from scratch and build a reasoning model from scratch.
I truly believe in the machine learning computer science world the best way to learn and understand something is to build it yourself from scratch.
Nathan is the post-training lead at the Allen Institute for A_I_ and author of the definitive book on reinforcement learning from human feedback. Both of them have great X_ accounts, great sub-stacks, Sebastian has courses on YouTube, Nathan has a podcast, and everyone should absolutely follow all of those.
And now a quick few second mention of each sponsor, check them out in the description or at LexFriedman.com/sponsors. It is in fact the best way to support this podcast. We got a bunch of great sponsors, Box for intelligent content management, Quo for your phone system, like call, text, contacts for your business, Uplift, desk, the desk I'm sitting behind and my favourite office desk, Thin for customer service A_I_ agents, shop
flat for selling stuff online, code rabbit, for A_I_ powered code review, element for electrolytes, and of course our long time friend for plexity. For curiosity driven knowledge exploration, choose wisely my friends. And now on to the full ad reads. I try to make 'em interesting, but if you do skip, please still check out the sponsors. I enjoy their stuff. Maybe you will too. To get in touch with me, for whatever reason go to Lex room
dot com slash contact. If you uh can't tell, I'm trying to have a bit of a pep in my step at the moment because I had a long night, didn't get much sleep at all, so I am running on fumes, delirious, happy, unsure of what is reality and what is a dream. In fact, we could right now be living inside of a dream. I have been going through a lot. I have been working insane
hours, so much going on, I am so overwhelmed, of course, as always truly grateful and happy to be alive, but have not been able to publish as many episodes as I would like, so there's a bunch of sponsors we'll have to catch up on. Your support truly means the world. Please check out all the sponsors, if you think it might be useful to you, buy their stuff, it really is the best way to support this podcast. Alright, let's go. First up, this episode was brought to you by Box, a cloud
based platform for content management, file sharing, and uh all kinds of collaboration or all kinds of content for your businesses, like with a lot of companies, the big question is how is A_I_ leveraged to make whatever the business does better? A lot of companies kinda use it for the hype and the label. It's it's kinda hilarious to watch people just say like powered by A_I_ and like I don't care if you're a bakery
by A_I_. I don't know. But outside of all of the hype, it is one of the most incredible things that humans have ever created. And so companies that can leverage that well are the companies that win. And of course Box is uh legendary for its file and content management, especially when you're talking about scale. So obviously it's amenable for the utilization of A_I_ to help automate some of the document
some of the workflows, some of the organisation, and they do that exceptionally well. They have a system called, as you could imagine, Box A_I_ that does just that. I love it to do an excellent implementation on the interface side, on the back-hand side, everything works extremely nicely. Help scale A_I_ across your organisation today and go to box.com/A_I_ that's box.com/A_I_ to learn more.
This episode is also brought to you by quote spelled Q_ U_ O_. Also happens to be a company name with just three letters that will help you win at Scrabble. Are you allowed to use company names with Scrabble? How many points is Q_? How many points is U_. I'm imagining a lot. That was one of the big confusions to me when I was first learning the English language. It always felt like Q_ should be at the end of the alphabet, maybe like Q_Z_. It was always
to my limited brain capacity, that Q_ was earlier on in the alphabet. What is it, O_P_Q_? I can't even actually localize letters in the alphabet, I'm sure that's the case for a lot of people, without reading the alphabet in my head sequentially. All of this has to do with short-term and long-term memory access, the functioning, the limitation of human cognition, and maybe cognitive systems in general.
All of it relevant to this particular episode, and not so relevant to the awesomeness of Quo formerly known as open phone that I should be talking about. Of course, as is always the case, I think the point here and at the point everywhere and the point of life is to talk from the heart about whatever you want, and that's what I try to do with everything, and to generalize that even more, to talk whenever I want.
and to shut the F_ up whenever I want and listen. And I prefer that more often than I prefer to talk. Insert clever transition here because talk is somehow relevant, it is. So Quo formerly known as uh open phone helps over ninety thousand businesses manage uh phone calls, texts, contacts, all kinds of phone related stuff for business. You have a bunch of customer
a bunch of incoming calls, a bunch of people on the business side that have to m answer those calls, have to manage it, what's the status of this particular request, voicemails transcripts, all that kind of stuff and obviously uh really nice effective utilization of A_I_ to make that really efficient. But really what's really important for things like this is that the interface is good, that team collaboration is good and quote delivers on that. Try quote for free plus get twenty percent off your first six months.
you go to quote dot com slash Lex, that's Q U O dot com slash Lex. Tell your friends about it, because it just might help 'em when it's scrabble. Speaking of scrabble, you usually wanna play scrabble on a table. It's such a magical experience. I just had a vision from a distant past of me sitting with a friend and playing scrabble at a table. What is this life? Full of beautiful memories. And then it's over too soon.
Yeah. That melancholy feeling is uh beautiful, I think. Insert another clever transition, ala Mark Norman maybe, because of the name of this next company's uplift desk. Uh as I said, uh okay, it's my go-to uh favourite office desk, and it's also the desk that I use for podcast furniture. I
I already lost count. I have a lot of uplift desks, standing desks in my place everywhere. It's desks everywhere. I have a mattress on the floor and uplift desks. So um I have a Linux box for robotics. I have a machine where I do a lot of the editing. All of that is on a desk. I have the three tables for the podcast desk, the the w the very one you s seen over the past several years. That's all uplift desks. I usually don't put them in standing mode, but they are
tasks that allows me to do all kinds of stuff, really easy to work with, really nice material, really sturdy. I just love everything about upload tasks. When they said they wanna sponsor after I've been using 'em for many years, I lost my mo I love it when I've been in love with a company, in love with their product for such a long time, and I get to also sing 'em praises. I mean come on, what are you gonna tell me next that F_F_M_ PEG wants to sponsor this podcast? Uh another sort of open source project
It's not a company that I've been uh in love with Anyway. go to upliftdesk dot com slash Lex and use code Lex to get four free accessories, free same day shipping, free returns, a fifteen year warranty and an extra discount off your entire order. That's U_P_L_I_F_T_D_E_S_K_ dot com slash Lex The. spelling it out really help anybody I. don't know but, they
really said pretty please, the one request is spell it out. Again, what is this life? Incredible. This episode is also brought to you by Finn, the number one A_I_ agent for customer service. Find the niche and become number one. That's the idea here. Anybody building an A_I_ company, and we talk about this, is the dream of A_G_I_ dead, uh I think for a lot of companies success is in the niche.
But there is a few, and FINN delivers on that niche. It's trusted by over six thousand customer service leaders at top companies, including A_I_ companies. When an A_I_ company trusts your company to do its customer service, that means you're legit. Ninety day money, bad guarantee, up to one million dollars, built to handle complex multi-step queries like returns exchanges, and disputes. Go to finn.ai/lex to learn more.
about transforming your customer service and scaling your support team. That's finn.ai/lex. I don't know why I switched to this hyping voice. Crappy announcer, crappy radio jockey, crappy ad-read voice. It is what it is. Thank you for sticking with me this long. I feel the love and I send it right back at you. This episode.
is also brought to you by a company whose engineers are also full of love Shopify. It just brings a smile to my face. Every time I think about Shopify, I got to see their engineering booth at uh NeurIPS, which is a machine learning conference. Really brilliant people, wonderful people. Of course the CEO Toby is still programming, still building stuff, still in on the details of the
engineering, and now is talking quite a bit about utilization of L_L_M_s for his own sort of pet projects, but also inside the company. It's just incredible when from the very top the company is in love with engineering. It's a celebration of great engineering. Just like the conversation with D_H_H_, who is the guy behind Ruby on Rails that Shopify was built on, that conversation was a celebration of great engineering.
The beauty of engineering as well. Anyway, listen to that episode to uh to see some of the magic of uh Ruby on Rails and the magic of Shopify and the magic of Tobii that we talk about. Anyway, sign up for a one dollar per month trial period at Shopify dot com slash luxe. That's all lower case. Go to Shopify dot com slash luxe to take your business to the next level today.
This episode is also brought to you by CodeRabbit, a platform that provides A_I_ powered code reviews directly within your terminal.
we talk a lot in this episode about the timeline for the full automation of the human programmer. I think we're quite far away from taking the human out of the loop. That review process, the debugging process, all of that, that's such a crucial part of uh programming, especially just like we talk about in the episode when we're not talking about a personal website where H_T_M_L_
SLOP is something that a web browser magically, automagically, I don't know how they're possibly able to do such incredible job of rendering SLOP, but a web browser is in fact able to uh render SLOP, including A_I_ SLOP. It just finds a way. So really the question is when you have production code, something that a lot of users are relying on, how do you review that code? How do you make sure you're catching the errors? How
or you making sure that uh you put a backstop to hallucinations and the logical errors that A_I_ coding agents can generate Anyway. code rabbit supports all programming languages. Install code rabbit C_L_I_ today at code rabbit dot A_I_ slash LEX That's. code rabbit dot A_I_ slash LEX This. episode is also brought to you by Element, my daily zero sugar and delicious electrolyte mix.
Reminds me of the fact that I need to get to editing the video of me in the jungle when uh Paul Rosling and I are such an incredible human. Congratulations to Paul on all all of his success. Go get his book. It's an incredible book. Again, he's an incredible person with an incredible mission. And yes, I need to edit and publish, hoping to at the very least.
um the story of our journey in the jungle because it was a beautiful celebration of nature and the jungle and friendship and the full richness of the human experience. It was beautiful. The reason I mention that is always as part of that journey uh severely dehydrated and I remember dreaming of element of a cold drink of water with the electrolytes. Your body craze it and it craze it because it needs it.
sodium, potassium, magnesium. When you're deprived, it's not just water, it's electrolytes.
So anyway, I always remember that. Get a free eight count sample pack with any purchase, try it at drinkelement dot com slash Lex.
This is the Lex treatment podcast. To support it, please check out our sponsors in the description where you can also find links to contact me, ask questions, get feedback and so on.
And now, dear friends, here's Sebastian Raschka and Nathan Lambert.
So I think uh one useful lens to look at all of this through is the deep seek so-called deep seek moment. This happened about a year ago in January twenty twenty five when the open-weight Chinese company deep seek released deep seek R_ one that uh I think it's fair to say surprised everyone with uh near or at state of the art performance with allegedly much less compute, far much cheaper and from then to today the A_I_
competition has got insane, both on the research level and the product level, it's just been accelerating. Let's discuss all this today and maybe let's start with some spicy questions if we can. Uh who is winning at the international level? Would you say it's a set of companies in China or the set of companies in the United States? And Sebastian Nathan, it's good to see you guys. Uh so Sebastian who, do you think is winning?
Um so winning is a very broad uh you know term. I I would say you mentioned the deep seek moment, and I do think deep seek is definitely winning the hearts of the people who work on open weight models because they share these as open models. Um winning I think has multiple time scales to it. We have today, we have next year, we have in ten years. One thing I know for sure is that um I don't think nowadays, two thousand twenty six, that there will be any company who is let's say having access to a
technology that no other company has access to. And that is mainly because researchers are frequently changing jobs, changing labs, they uh rotate. So I don't think there will be a clear winner in terms of technology access. However, I do think there will be uh the differentiating factor will be budget and hardware constraints. So I don't think the ideas will be proprietary, but the way or the resources that are needed to implement them. And so I
see currently uh take it all scenario where a winner takes it all. I I can't see that at the moment.
Uh Nathan, what do you think?
You see the labs put different energy into what they're trying to do and I think to demarcate the point in time when we're recording this um the hype over anthropics Claude Opus four point five model has been absolutely i insane which is just I mean I've used it and built stuff in the last few weeks and it's uh it's almost gotten to the point where it feels like a bit of a meme in terms of the hype and it's kind of funny because this is very organic and then if we go back a few months ago we can get the release date and the notes is Gemini three from Google got
released and it seemed like the marketing and just like wow factor of that release was super high, but then at the end of November, Claude Opus four point five was released and the hype has been growing, but Gemini three was before this and it kind of feels like people don't really talk about it as much, even though when it came out everybody was like this is um Gemini's moment to retake kind of Google structural advantages in A_I_ and Gemini three is a fantastic model and I still use it, it's just kind of differentiation is lower and I
with Sebastian what you're saying with all these like the idea space is very fluid, but um culturally anthropic is known for betting very hard on code, which is cloud code thing is working out for them right now. So I think that even if the ideas flow pretty freely, so much of this is bottlenecked by human effort and kind of culture of organisations where anthropic seems to at least be presenting as the least chaotic. It's it's a bit of an advantage and if they can keep doing that for a while, but on the other side of things there's a lot of ominous
technology from China where there's way more labs than DeepSeek. So DeepSeek kicked off a movement within China, I say kind of similar to how chat G_B_T_ kicked off a movement in the U_S_ where everything had a chat bot. Uh there's now tons of tech companies in China that are releasing very strong frontier open weight models to the point where I would say that DeepSeek is kind of losing its crown as the preeminent open model maker in China and the likes of um Z_ dot A_I_s with their G_L_M_ models, Minimax's models, um Kimi
shot uh especially in the last few months has shown more brightly. The new deep seek models are still very strong, but that's kind of a it it could look back as a big narrative point where in twenty twenty five deep seek came and then all and it kind of provided this platform for way more Chinese companies that are releasing these fantastic models to kind of have this new t type of operation. So these models from these Chinese companies are open-weights and depending on this trajectory of business models that these American companies are doing could be at risk. But currently
lot of people are paying for A_I_ software in the U_S_ and historically in China and other parts of the world, people don't pay a lot for software.
So some of these models like DeepSeek uh have the love of the people because they are open-weight. Uh how long do you think uh the Chinese companies keep releasing open-weight models?
I would say for a few years I think that like in the U_S_ there's not a clear business model for it. I've have s been writing about open models for a while and these Chinese companies have realized it, so I get inbound from some of them. And they're smart and realize the same constraints which is that a lot of U_S_ tump tech companies and other I_T_ companies won't pay for a A_P_I_ subscription to Chinese companies for security concerns. This has been a long standing um habit in tech and the people of these companies then see open weight models as an ability to influence and take
of a huge growing A_I_ expenditure market in the U_S_. And they're very realistic about this. And it's working for them, and I think that the government will see that that is building a lot of influence internationally in terms of uptake of the technology. So there's gonna be a lot of incentives to keep it going, but building these models and doing the research is very expensive. So at some point I expect consolidation, but I don't expect that to be a story of twenty twenty six, where there'll be more open model builders throughout twenty twenty six than there were in twenty twenty
five and a lot of the notable ones will be in China.
You were gonna say something?
Um yes, you mentioned DeepSig losing its crown. I do think to some extent yes, but we also have to consider though they are still I would say slightly ahead and the other ones it's not that DeepSig got worse, it's just like the other ones are using the ideas from DeepSig for example you mentioned Kimi, the same architecture, they're training it and then again we have this leapfrogging where they might be at some point in time a bit better because they have the more recent model and I think this comes back to um the f the fact that they
be a clear winner and it's will it will just be like like that and one person releases something the, other one comes in and the the recent the most recent model is probably always the best model.
Yeah. We'll also see that Chinese companies have different incentives. So like deep seek is very secretive where some of these startups are like the minimaxes and Z_ dot A_Is of the world. Those two literally have filed I_P_O_ paperwork and they're trying to get Western mind-share and do a lot of outreach there. So I don't know if these incentives will kind of change the model development 'cause deep seek famously is built by a hedge fund, a high flyer capital, and we don't know exactly what they look we don't know what they use the models for or if they care about this.
the communication, they're not secret in terms of the technical reports that describe how their models work. They're still open on that front. And we should also say on the Opus four five hype, there's the layer of uh something being the darling of the X echo chamber, on Twitter echo chamber, and the actual amount of people that are using the model. I think it's probably fair to say that CHI G_P_T_ and JAM and I are focused on the broad user base that just want
solve problems in their daily lives, and that user base is gigantic. So the hype about the coding may not be representative of the actual use.
I would say also um a lot of the usage patterns are like you said name recognition, brand uh and and stuff but also muscle memory almost where um you know like Chachapeti has been around for a long time, people just got used to using it and it's kind of like almost like a flywheel, they recommend it to other users and that stuff One. interesting point is also the customisation of uh L_M_M_s for example, Chachapeti has a memory feature right, and so you may have a subscription and you use it for personal stuff, but I don't know if you
to use that same thing at work, you know, because that's a boundary between private and work. If you're working at a company, they might not allow that. Or you may not want that. And I think that's also an interesting point where you might have multiple subscriptions. One one is just clean code. It keeps n has nothing of your personal images that you or hobby projects in there. It's just like the work thing. And then the other one is your personal thing. So I think that's also something where two different use cases and it doesn't mean you only have to ha have one. It's it's I think the future is also multiple ones.
what model do you think won twenty twenty five? And what model do you think is gonna win twenty six?
I think in the context of a consumer chatbots is a question of are you willing to bet on Gemini over chat G_B_T_ which I would say in my gut feels like a bit of a risky bet because open A_I_ has been the incumbent and there's so many benefits to that in tech that I think the momentum I feel like in twenty twenty five was on Gemini's side but they were starting from such a low point I think on R_I_P_ barred and these earlier attempts uh uh of getting started I think huge
it for them for powering through the organisational cr chaos to make that happen. But also it's hard to bet against Tratt to open A_I_ because they always come off a cast as so chaotic, but they're very good at landing things. And I think like uh personally I have very mixed reviews of G_P_T_ five, but it had to have saved them so much money, with the headline feature being a router where most users are no longer charging it like charging their G_P_U_ costs as much. So I think it's very hard to dissociate the things that I
like out of models versus the things that are gonna actually be a general public differentiator.
What do you think about twenty twenty six? Who's gonna win?
I'll say something, even though it's risky, I will say that I think Gemini will continue to take progress on chat G_P_T_ I think Google scale when both of these are operating at such extreme scales and like Google has the ability to separate that research and product a bit better where you hear so much about web A_I_ being chaotic operationally and chasing the high impact thing which is a very start-up culture. And then on the software and enterprise side I think entropic will have continued to success as they've again and again been set up for that and
Obviously Google's cloud has a lot of offerings, but I think this kind of like Gemini name brand is important for them to build. And and Google's cloud will continue to do well as but that's kind of a c more complex thing to explain in the ecosystem because that's competing with the likes of Azure and AWS rather than on the model provider side.
So in the infrastructure you think T_P_U_s give an advantage?
largely because the margin on NVIDIA chips is insane and Google can develop everything from top to bottom to fit their stack and not have to pay this margin and they've had a head start in building data centers. So all of these things that have both high lead times and very hard margins on high costs, Google has a just kind of a historical advantage there and uh if there's gonna be a new paradigm, it's most likely to come from open A_I_ where they're kind of their research division again and again has kind of shown this ability to land a new research
idea or a product, I think like deep research, SORA, O_ one thinking models, like all these definitional things have come from open A_I_ and that's gotta be one of their top mm trades as an organisation. So it's kind of hard to bet against that, but I think a lot of this year it will be about scale and optimising what could be described as low-hanging fruit in models.
And clearly there's a trade-off between intelligence and l speed This. was what Chad G_P_T_ five was trying to solve behind the scenes. It's like do people actually want intelligence? The broad public, or do they want speed?
I think it's a nice variety actually, or the option to uh have a toggle there. I mean first for my personal usage, most of the time when I look something up I use chat G_P_D_ to ask a quick question, get the information I want it fast. For you know most daily tasks I use the quick model. Nowadays I think the auto mode is pretty good where you don't have to specifically say thinking or you know non-thinking and stuff. Then again I also sometimes want the pro mode. Very often what I do is when I have something written I put it into uh chat G_P_D_ and say hey do, a
thorough check is are all my references correct? Are all my thoughts correct? Uh did I make any formatting mistakes? And are the figure numbers wrong or something like that? And I don't need that right away. It's something okay, I finish my stuff, maybe have dinner, let it run, come back and it goes through this. And I think see this is where I think it's important to have this option. I would go crazy if for each query I would have to wait thirty minutes or ten minutes even. Yeah.
non-thinking model, I'm like oh, how do you how do you live with how do you live with that. It's like my reaction I'm, been heavily on chat G_P_T_ for a while um, never touched five non-thinking, I find it t it's tone and then it's propensity of errors. It's just like a high likelihood of errors. Some of this is from back when opening I released O_ three, which was the first model to do this deep search and find many sources and integrate them for you. So it became habituated with that, so I will only use G_P_T_ five point two thinking or
when I'm finding any sort of information query for work, whether that's a paper or some code reference that I found. And it's just like I've I will regularly have like five pro-queries going simultaneously, each looking for one specific paper or feedback on the equation or something. I have a fun example where I just needed to answer as fast as possible for this podcast before I was going on the trip. Um I have like a local G_P_U_ running at home and I wanted to run a long R_R_L_ experiment. And usually I
also unplug things, because you never know if you're not at home, you don't wanna have things plugged in, and I accidentally unplugged the G_P_F_. It was like my wife was already in the car, and it's like oh dang, and then basically I wa wanted, as fast as possible, a bash script that runs my different uh experiments and evaluation, and I did something I know I learned how to use the bash uh interface be well bash terminal, but i in that moment I just needed like ten seconds give me the command.
This is a hilarious situation, but yes, what did you use?
Uh so I did the f non-thinking fastest model. It gave me the bash uh command. I uh to chain different uh uh scripts to each other, and then the thing is like you have the T_ thing where you want to route this to a o log file. Top of my head I was just like in a hurry, I could have thought about it myself.
By the way, I don't know if there's a representative case, wife waiting in the car, you have to run, you know, plug the G_P_U_ you, have to generate a bash script, this sounds like a movie like, you wish it were possible.
I use Gemini for that. So I use thinking for all the information stuff and then Gemini for fast things or stuff that I've could come time to Google which is like it's good at explaining things and I trust that it has this kind of background of knowledge and it's simple and the Gemini app has gotten a lot better and it's good for that sort of things. And then for code and any sort of philosophical discussion I use Claude Opus four point five. Also always with extended thinking. I extended thinking and inference time scaling is just a way to make the models um marginally smarter and I will always
edge on that side when the progress is very high because you don't know when that'll unlock a new use case, and then sometimes use GROC for um real time information or finding something on A_I_ Twitter that I knew I saw and I need to dig up and I just fixated on. Although I when GROC four came out the GROC four what is super heavy which was like their pro variant was actually very good and I was pretty impressed with it and that was just kind of like muscle memory, lost track of it with having the chat T_V_T_ app open. So I used many different things.
Yeah, I actually do uh do use GRAP for heavy for debugging, for like hardcore debugging and the other ones can't solve it. I find that it's the best at and I it it's interesting 'cause you say ch ch G_P_T_ is the best interface. Uh for me for that same reason, but this could be just momentum, uh Gemini is the better interface for me. I think because I fell in love with their best needle in the haystack. If I ever put something that has a lot of contacts, but I'm
can have very specific kinds of information make, sure it tracks all of it. I find at least uh the Gemini for me has been uh the best. So it's funny with some of these models, if they win your heart over for one particular feature at one on a one particular day for that particular query uh th that prompt, you're like this model is better. And so you'll just stick with it for a bit until it does something really dumb there's, like a threshold effect some, smart thing and then you fall in love with it.
And then it does some dumb thing and you're like you know what I'm, gonna switch to tri-cloud or J_G_P_T_ and all that kind of stuff.
this is exactly like you use it until it breaks until you have a problem and then then you ch uh change uh the L_M_ and I think i it's the same how we use anything like uh our favourite text editor um operating systems or m the browser I. mean there are so many browser options Safari, Firefox, Chrome, uh all the c relatively similar but then there are ex etch cases maybe extensions you wanna use and then you switch but, I don't think there is any w one who types the same thing like the website into different browsers and compares them, you only do
when the website doesn't render if something breaks, I think. So that's that's a good point. I think you use it until it breaks and then you explore other options, I think.
On the long context thing, I was also a Gemini user for this, but the G_P_T_ five point two release blog had like crazy long context scores where a lot of people were like, did they just figure out some algorithmic change? It went from like thirty percent to like seventy percent or something, and this minor model update. So it's also very hard to keep track of all of these things, but now I'm m look more favourably at G_P_T_ five point two is long context, so it's just kinda like how do I actually get to d testing this s never-ending battle.
Well, it's interesting that none of us talked about the Chinese models from a user usage perspective. What does that say? Does that mean the Chinese models are not as good, or does that mean we're just very biased uh and U_S_ focused?
I do think that that's co currently the discrepancy between just the model and the platform. So I I think the open models, they are more known for the open weights, not their platform yet.
Mm-hmm.
So these models from the U_S_ are better and th in terms of the outputs I think that the question is will they stay better for this year and for years going, but it's like so long as they're better I'm gonna pay for you to use them. I think there's also analysis that shows that like the
way that the Chinese models are served that you could argue due to expert controls or not is that they use fewer G_P_Us for replica which makes them slower and have different errors and it's like speed and intelligence. If these things are in your favour as a user, I think in the U_S_ a lot of users will go for this and I think that that is a good thing that will
spur these Chinese companies to want to compete in other ways, whether it's like s s free or substantially lower costs or it'll breed creativity in terms of offerings which is good for the ecosystem. But I just think of a f a simple thing is that U_S_ models are currently better and we use them and I try Chinese mo I try these other open models and I'm like fun but not gonna I don't go back to it.
Uh we didn't really mention programming. That's another use case that a lot of people deeply care about. So I use basically half and half cursor and clog code because there I find them to be like fundamentally different experience and both useful. Uh what do you guys you program quite a bit so what what do you use what's the current vibe?
So I use the codecs plug-in for V_S_ code uh you know it's very convenient it's just like a plug-in and then it's a chat interface that has access to your repository. I know that clog code is I think a bit different. It is a bit more agentic, it touches more things, it does a whole project for you. I'm not quite there yet where I'm comfortable with that because uh maybe I'm a control freak but I still would like to see a bit what's going on and codecs is kind of right right now for me like the sweet spot where it is helping me but it is not taking completely over.
I should mention one of the reasons I do use clod code is to build the skill of programming with English. I mean the experience is fundamentally different. You're as opposed to micromanaging the details of the process of the generation of the code and uh looking at the diff which you can incurs there i if that's the I_D_ you use and and in changing altering, looking and reading the code and understanding the code deeply as you progress versus just kinda like
thinking in this design space and just guiding it at this uh macro level, which I think uh is another way of thinking about the programming process. Also we should say that Claude code, it just seems to be somehow a better utilisation of Claude Opus four five.
It's a good side-by-side for people to do. So you can have Claude code open, you can have cursor open, and you can have V_S_ code open, and you can select the same models on all of them and ask questions that are very interesting. Like the the Claude code is way better in that domain, it's remarkable.
we should say that both of you are are legit on multiple fronts, researchers, programmers, educators, tweeterers, and on the book front too. So Nathan, at some point soon hopefully has an R_L_H_F_ book coming out.
It's available for pre-order and there's a full digital pre-print just making it pretty and better organised for the physical thing, which is a lot of why I do it, because it's fun to create things that you think are excellent in the physical form when so much of our life is digital.
I should say going to perplexity here Sebastian, Rasha is a machine learning researcher and author known for several influential books, a couple of them that I wanted to mention which is a book I highly recommend build a large language model from scratch and the new one build a reasoning model from scratch. So I'm really excited about that. Building stuff from scratch is one of the most powerful ways of learning.
honestly building an L_M_ from scratch is a lot of fun, it's also a lot of l to learn and like you said it's probably the best way to learn how something really works 'cause you can look at figures, but figures can have mistakes. Uh you can look of con uh concepts explanations, but you might m misunderstand them. But if you see the co there is code and the code works, you know it's correct. Uh I mean there's no misunderstanding, it's like it's precise, otherwise it wouldn't work. And I think that's like kind of like the beauty behind coding, it is kind of like it doesn't lie, it's math basically.
So even though with math I think you can have mistakes in a book you would never notice, because you're not running the math when you are reading the book, you can't verify this. And with code what's what's nice is you can verify it.
Yeah, I agree with you about the L_M_ from Scratch book. It's nice to tune out everything else, the internet and so on and just focus on the book. But you know I, read uh several like you know uh history books. It's just less l lonely somehow. It's really more fun. Like yeah, for example on the programming front, I think it's genuinely more fun to program with an L_L_M_, and I think it's m genuinely more fun to read with an L_L_M_, but you're right like this distraction should be
minimised. So it's uh you use the L_L_M_ to basically enrich the experience, maybe add more context, maybe the g I just the rate of aha moments for me in a small scale is really high with the L_M_s.
hundred percent I would s I also want to correct myself I'm not suggesting not to use L_M_M_s uh I suggest doing it in multiple passes like one pass just off-line focus mode and then after that uh I mean I also take notes but I I try to resist the urge to f immediately look things up I. uh do a second pass it's just like for me more structured this way and I get le I mean sometimes things are answered in the chapter but sometimes also it just helps to let it sink in and think about it. Other people have different preferences I would highly
recommend using L_L_M_s when reading books. For me it's just it's not the first thing to do it's like the second pass.
by way of recommendation as I say it I do the opposite I like to use the L_M_ at the beginning to lay out the full context of like what is this world that I'm now stepping into but I try to avoid clicking out of the L_M_M_ into the world of like Twitter and blogs and because then you're now down this rabbit hole you're reading somebody's opinion there's a flame war about a particular topic and all of a sudden you're no longer you're now in the l in the realm of the internet and Reddit and so on.
if you're purely l letting the L_L_M_ give you the context of why this matters what are the big picture ideas uh but sometimes books themselves are good at doing that but not always so,
is why I like the chat G_P_T_ app 'cause it gives the A_I_ a home in your computer when you are folk you can focus on it rather than just being another cab in my mess of internet options and I think cloud code in these particularly does a good job of making that a joy where it seems very engaging as a product designed to be an interface that your A_I_ will then go out into the world and is something that is very kind of intangible to between it and codecs is that it just feels kind of warm and engaging where codecs can often be as good
from open A_I_ but it just kind of like feels a little bit rougher on the edges whereas the cloud code is makes it fun to build things, particularly from scratch where you just don't like you don't have to care but you trust that it'll make something like obviously this is good for websites and kind of refreshing tooling and stuff like this which I'd use it for or data analysis so I'd my my blog we scrape hugging pace so we keep the download numbers for every data set and model over time now so we have them and it's like cloud is just like yeah I've made use of that data no problem and all like that would've
gonna be days mostly. Then I have enough situational awareness to be like okay these trends obviously make sense and you can check things. So that's just a kind of wonderful interface where you can have an intermediary and not have to do the kind of awful low level work that you would have to do to maintain different web projects and do this stuff.
Alright so we just talked about a bunch of the closed weight models. Well let's talk about the open ones. Uh so tell me about the landscape of open L_M_ models. Which are interesting ones which stand out to you and why? We already mentioned DeepSeek.
Do you wanna see how many we can name off the top of our head?
Yeah yeah, without looking at notes.
DeepSeek, KIMI, Minimax, Z_ dot A_I_, Ant, Lang. We're just going Chinese. Um let's go in Mistral A_I_, JAMA, um G_P_T_O_S_, the open source model by uh Chet G_P_T_ Actually. in NVIDIA NemoTron had a or NVIDIA had a very cool one, uh NemoTron three. Um there there's a lot of stuff uh especially at the end of the year. Quen, one maybe the one
pressing. I was trying to get through the you can get at least ten Chinese and at least ten Western. I think that I mean, Opening Eye released their first open model since G_P_T_ two. That wa when I when I meant talk when I was writing about opening Eye's open model release, they're all like don't forget about G_P_T_ two, which I thought was really funny 'cause it's such a different time. But G_P_T_O_S_ is actually a very strong model and does some things that the other models don't do very well, and I think that selfishly I'll promote a bunch of like Western companies, so both in the
us in Europe have these like fully open models. So I work at Allen Institute for A_I_ We've. been building OLLMO, which releases data and code and all of this. And now we have actual competition for people that are trying to release everything so that other people can train these models. So there's the Institute for Foundation Models, where it slash L_M_ three sixty, which is like had their K_ two models of various types. Aparis is a Swiss research consortium. Huggingface um has small L_M_, which is very popular um in Nvidia's Neematron has started
data as well, and then Stanford's Marin community project, which is kind of making it so there's a pipeline for people to open a GitHub issue and implement a new idea and then have it run in a stable language modelling stack. So this space, that list was way smaller in twenty twenty four, so I think it was like just A_I_ too, so that's a great thing for more people to get involved and to understand language models, which doesn't really have a like a Chinese company that is has an analogue. While I'm talking, I'll say that the Chinese open
language models tend to be much bigger, and that gives them this higher peak performance as M_O_E_s where a lot of these things that we like a lot, whether it was JEMMA um in NIMA-TRAN, have tended to be smaller models from the U_S_ which is which is starting to change from U_S_ U_S_ in Europe. Um MISTRA large three came out which was a giant M_O_E_ model very, similar to deep seek architecture in December. And then a start-up R_C_A_I_ and both NIMA-TRAN have NIMA-TRAN as NVIDIA have teased M_O_E_ models of this uh way
bigger than a hundred billion parameters, like this four hundred billion parameter range coming in this like Q_ one twenty twenty six timeline. So I think this kind of balance is set to change this year in terms of what people are using the Chinese versus U_S_ open models for, which will be an what I'm personally s gonna s be very excited to watch.
First of all, huge props for being able to name so many of these. Did you actually name llama?
Um no.
If you like R_I_P_.
Well it's not on purpose. Alright P_ llama, alright can you mention what are some interesting models that stand out? So you mention QUIN three is is is obviously a stand out.
So I would say the years almost book ended by both uh deep seek version three and R_ one, and then on the other hand in December uh deep seek version three point two, because what I like about those is they d always have a interesting architecture tweak that others don't have. But otherwise if you wanna go with um you know like the familiar but really good performance uh QUIN three and like um Nathan said also G_P_D_O_S_S_, and I think G_P_D_O_S_S_ what's interesting about it is kind of like the first public or like open-weight model that was really trained with
use in mind, which I do think is kind of a little bit of a paradigm shift where the ecosystem was not quite ready for it. So with tool use I mean that the L_L_M_ is able to do a web search to call a Python interpreter. And I do think this uh it's a stand-off because I think it's a huge unlock because uh one of the most um common complaints about L_L_M_s are for example hallucinations, right. And so in my opinion one of the best ways to solve uh hallucinations is to not try to always remember information or m make things up. For
why not use a calculator app or Python. If I asked the N_L_M_ who won the, I don't know, soccer World Cup in nineteen ninety eight uh instead of just trying to memorize, it could go do a uh search. I think mostly it's usually still uh Google search. So G_G_P_D_ G_P_D_O_S_S_ they would do a tool call to Google, maybe find the FIFA website, find okay it was France. It would get you that information re reliably instead of just trying to memorize it. So I I think it's a huge unlock uh which I think right now is not fully
utilized it by the open source, open weight ecosystem. A lot of people don't use tool call modes because I think it's first it's a trust thing. You don't wanna run this on your computer where it has access to tools, could wipe your hard drive or whatever. So you wanna maybe h contain containerize that. Um but I do think, you know, that that is like a really important step um for the upcoming years to have this uh ability, you know.
So uh a few quick things First. of all thank, you for defining what you mean by tool use. I think that's a great thing to do in general for the concepts we're talking about. Even things as sort of well established as M_O_Es uh y you have to say that means mix your X_ person. I mean you could kinda have to build up an intuition for people what that means, how it's actually utilized, what are the different flavors. So w what does it mean that there's just such explosion of open models? What's your intuition?
an open model, you want people to use it as the first and foremost thing, and then after that comes things like transparency and trust. I think when you look at China, the biggest reason is that they want people around the world to use these models, and I think a lot of people will not. If you look outside of the U_S_, a lot of people will not pay for software, but they might have computing resources where you can put a model on it and run it. I think it there can also be data that you don't want to send to the cloud. So this the the number one thing is getting people to use models, use A_I_ or use your A_I_ that might not be able to do it
without having access to the model. I guess we should state explicitly so we've been talking about these Chinese models and open weight models oftentimes the way they're run is locally. So it's not like you're sending your data to China or to whoever developed uh to Silicon Valley whoever developed the model.
a lot of American startups make money by hosting these models from China and selling them selling tog it's called like selling tokens, which means somebody will call the model to do some se some piece of work. I think the other reason is for U_S_ companies like to have j opening eyes, so G_P_U_ deprive like they're so they're at the limits of the G_P_U_s. Whenever they make a release they're always talking about oh like our G_P_U_s are hurting and I think there's like m like in one of these like G_P_T_O_S_S_ release sessions Sam Altman said like oh we're releasing this because we can use your
We don't have to use we ju don't have to use our G_P_Us and, openly I could still get distribution out of this, which is an another very real thing 'cause it doesn't cost them, so anything.
And for the user I think also I mean there are users who just use the model locally how they would use uh G_P_D_ but also for companies uh I think it's a huge unlock to have these models because you can customise them, you can train them, you can uh add post-training con uh uh add more data like specialise them into let's say law medical models, whatever you have. And the appear you mentioned NAMA the, appear of the open weight models from China is that the open weight models are also the licenses are even friendlier I. think they are just unrestricted open source
where if we use something like uh llama or gemma there are some strings attached. I think it's like a upper limit in terms of how many users you have and then if you exceed I don't know so many million users you have to report your finance um situation to let's say meta or something like that and I think well it is a free model but there are strings attached and people do like s n things where strings are not attached. So I think that's also one of the reasons uh besides performance why the open weight models from China are so popular because you you can just use
and there's no there's no catch in that sense, yeah.
The ecosystem has gotten better on that front, but mostly downstream of these new providers providing such open licenses. That was funny when you pulled up Perplexity and said Kimmy K_ two thinking hosted in the U_S_ which is just like an exact I've never seen this, but it's an exact example of what we're talking about where people are sensitive to this. So like Kimmy K_ two thinking and Kimmy K_ two is a model that is very popular, people say that it has very good like creative writing and also in doing some software things. So it's just these little quirks that people pick up on with different models that they like.
Uh what are some interesting ideas that some of these models have explored that you can speak to, like that particular interesting to you?
Maybe we can go chronologically. I mean there was of course deep seek um deep seek R_ one that came out in January if we just focus on two thousand twenty five. However this was based on deep seek version three which came out the year um before in December two thousand twenty four. Uh there are multiple things on the architecture side. What is fascinating is you you can still I mean that's what I do in my from scratch coding projects. You can still start with G_P_D_ two and you get can add things to that model to make it into this other model. So it it's all still kind of like the same lineage, the same it is a very
uh relationship between those. But uh top of my head, DeepSeek, what was uh unique there is the mixture of exp uh not I mean they were not inventing mixture of experts. We can maybe talk a bit more what mixture of experts means. Um but just to list these things first before we dive into a detail mixture of experts, but then they also had a multi-head latent attention, which is a tweak to the attention mechanism, where this was I would say two thousand twenty five, the main distinguishing factor becau between these uh open weight
else different tweaks to make inference or K_V_ cache size. We can also define K_V_ cache in a few moments. But to kind of make it more economical to have long contacts to shrink the K_V_ cache size. So what are tweaks um that we can do and most of them focused on the attention mechanism. There is multi-head latent attention in in deep seek. There is uh group query attention which is still very popular. It's not invented by any of those models. It goes back a few years. But that that would be the other option. Sliding window attention I think
three uses it um if I remember correctly. So there are these different tweaks that make the models different. Otherwise um I put them all together in article ones where um I just compare them. They are very surprisingly similar. It's just different numbers in terms of how many rep petitions of the transformer block you have in the centre uh and like mm just little little knobs that people tune. But but what's so nice about it is it's it it works no matter what. You can tweak things, you can move the normalisation layers around,
get some performance gains and I al almost always very good in evaluation studies showing what actually uh what it does to the model if you move something around. Um evaluation studies doesn't make it better or worse. But there are so many let's say ways you can implement a transformer and make it still work. Big ideas um that are still prevalent is mixture of experts, uh multi-attend attention, um sliding window attention, group query attention, and then at the end of the year we saw a focus on making the attention mechanism scale linearly with inference token
So there were QAM three next, for example, which added a gated delta net. It's it's like um kind of like inspired by um state space models where you have a fixed state that you keep updating, but it makes essentially this attention cheaper or it replaces attention with a cheaper operation.
And it may be is it useful to step back and talk about transform architecture in general?
Yeah, so maybe we should start with the G_P_T_ two architecture, the transformer that was derived from the attention is all you need paper. Uh so the attention uh is all you need paper had a transformer architecture that had two parts, an encoder and a decoder. And G_P_T_ went just focusing in on the decoder part. It is essentially a still a neural network um and it has this attention mechanism inside. And you predict one token at a time, you pass it through an embedding layer, there's the
transformer block. The transformer block has uh tension modules and a fully connected layer. And there are some normalization layers in between. But it's essentially neural network layers with this attention mechanism. So coming from G_P_T_ two, uh when we c move on to G_P_T_ O_S_S_ there is for example the mixture of experts um layer. It's not invented by G_P_T_ O_S_S_, it's a few years old. Um but it is essentially a t a tweak to make the model larger without consuming more compute in each forward pass.
there is this uh fully connected layer, and if listeners are familiar with um multi-layer perceptrons, you can think of a mini multi-layer perceptron, a fully connected neural network layer inside the transformer, and it's very expensive because it's fully connected. If you have thousand inputs thousand, outputs, that's like a m one million connections, and it's a very expensive part in this transformer, and the idea is to kind of expand that into multiple feed-forward net uh networks. So instead of having one, let's say you have two and fifty
But it would make it way more expensive, because now we have twenty fifty six. But you don't use all of them at the same time. So you now have a router that says, okay, based on this input token, it would be useful to use this um fully connected network. And in that context it's called an expert. So a mixture of experts means we have multiple experts. And depending on what your input is, uh let's say it's more math heavy, it would use different experts compared to let's say translating input text from English to Spanish. It would maybe console different experts.
not quite clear I mean as clear cut to say okay, this is only an expert for math and for Spanish is a bit more fuzzy, but the idea is essentially that you pack more knowledge into the network, but not all the knowledge is used all the time. That would be very wasteful. So you are kind of like during the token generation you are more selective, there's a router that selects which tokens should go to which expert. It's more complexity, it's harder to train, there's a lot of you know that can go wrong, like collapse and everything. So I think
that's why Alma three still uses uh dense. I mean you have I think Alma models with mixture of experts, but dense models uh where dense means uh so also it's jargon. There's a distinction between dense and sparse. So mixture of experts is considered sparse because we have a lot of experts, but only few of them are active. So that's called sparse. And then dense would be the opposite where you only have like one fully connected module and it's always, you know, utilised.
So maybe maybe it's a good place to also talk about K_V_ cache, but actually before that even zooming out, like fundamentally how many new ideas have been implemented from from G_P_T_ two to today. Like how different really are these architectures?
Mm-hmm.
layer norm by R_M_S_ norm, but it's just like a different normalization layer and not a big change, it's just like a a tweak. Um the non-linear activation function um people familiar in with deep neu networks, I mean it's the same as changing sigmoid with relu it's it's not changing the network uh fundamentally, it's just like a tweak like a little little tweak. Um and and that's about it I would say, it's not really fundamentally that different, it's still the same same architecture, so you can convert one from one uh you can go from one into the other by just adding these
changes basically. Mm-hmm. Yep. So for example you mentioned my book earlier, that's uh G_P_D_ two model in the book, because it's simple and it's very small um, so a hundred twenty four uh one hundred twenty million parameters approximately. But in the bonus materials I do have almost three from scratch, GEMMA three from scratch, and other types of from scratch models. And I always start it with my G_P_D_ two model and just you know tweak the f well edit different components and you get from one to the other. It's like it's kind of like um lineage in a sense, yeah.
But intuition for people, because uh sort of when you zoom out you look at it there's so much rapid advancement in the A_I_ world, and at the same time fundamentally the architectures have not changed. So where is all the turbulence the, turmoil of the advancement happening? Where where is the gains to be had?
you have the pre-training. Now um back then they it was just pre-training with G_P_T_ two. Now you have pre-training, mid-training and post-training. Um so I I think right now we are in the post-training focus stage. I mean pre-training still gives you um advantages if you scale it up to better higher quality data. But then we have capability unlocks that were not there with G_P_T_ two. For example uh CHET G_P_T_ it is basically a G_P_T_ three model and and G_P_T_ three is the same as G_P_T_ two in terms of
architecture. What was new was adding the um supervised fine tuning and the reinforcement learning with human feedback. So it's more on the algorithmic side rather than the architecture.
I would say that the systems also change a lot I. think if you listen to NVIDIA's announcements they talk about these things like you now do F_P_ eight, you can now do F_P_ four and what is happening is these labs are figuring out how to d utilize more compute to put it into one model which lets them train faster and that lets them put more data in and then you can find better configurations faster by doing this.
can look at like the essentially the tokens per second per G_P_U_ is a metric that you look at when you're doing large scale training. And you could get you can go from like ten K_ to thirteen K_ by s turning on F_P_ eight training, which means you're using less memory per parameter in the model. And by saving less information you do less communication, you can train faster. So all of these like system things underpin way faster experimentation on data and algorithms that is kind of like it's this s it's this kind of
where it's kind of hard to describe when you look at the architecture and they're exactly the same but, the code base used to train these models is gonna be vastly different and you could probably like I don't the G_P_Us are different but, you probably train G_P_T_O_S_S_ twenty B_ way faster in wall clock time than G_P_T_ two was trained at the time.
for the speed, this is true, but uh it it doesn't give the model new capabilities in a sense. It's just how much can we make make the computation coarser without suffering in terms of model performance degradation. Um but I do think I mean there are alternatives popping up to the transformer. There's text diffusion models, a completely different paradigm. Um and there's also I mean though text diffusion models might use transformer architectures, but it's not an autoreg autoregressive um transformer. And also MAMBA models uh it's a
state space model. But they do have trade-offs and uh what's right is there's nothing that has replaced the um autoregressive transformer as state of the art model. So like for state of the art you would still do that g go with that thing. But there are no alternatives for the cheaper and like alternatives that m are kind of um making compromises, but i it's not just one architecture anymore. There are little ones coming up. But if we talk about the state of the art, it's pretty much still the the transformer architecture autoregressive.
Derived from G_P_T_ two essentially.
I guess the big question here is we talked quite a bit here on the architecture behind the m the pre-training. Are the scaling laws holding strong across pre-training, post-training, inference, contact size, data, synthetic data?
I like to start with the technical definition of scaling law which kind of informs all of this. The scaling law is a power law relationship between you could think of the X_ axis, so kind of what you are scaling as a combination of compute and data which are kind of similar, and then the Y_ axis is like the held out prediction accuracy over next token. So we talk about models being auto-regressive, it's like if you keep a set of mm text that the model has not seen, how accurate will it get when you'll train. And the idea of scaling laws came when
people figured out that that was a very predictable relationship and I think that that technical term is continuing and then the question is like what do users get out of it and then there are more types of scaling where um open A_I_s O_ one was famous for introducing inference time scaling and I think less famously for also showing that you can scale reinforcement learning training and get kind of this log X_ axis and then a linear increase in performance on Y_ axis. So there's kind of these three axes now where the traditional
scaling laws are talka talked about for pre-training, which is how big your model is and how big your data set is. And then scaling, reinforcement learning, which is like how long can you do this trial and error learning that we will talk about will define more of this. And then this inference time compute, which is just letting the model generate more tokens on a specific problem. So I'm kind of bullish where they they're all really still working, but the low hanging fruit has mostly been taken, especially in the last year on um reinforcement learning with verifiable rewards, which is this R_L_V_R_ and then inference time scaling.
which is just why these models feel so different to use where previously you would get that first token immediately. And now they'll go off for seconds minutes or even hours generating these hidden thoughts before giving you the first word of your than answer and that's all about this inference time scaling which is such a wonderful kind of step function in terms of how the models change abilities. They've kind of enabled this tool use stuff and enabled this much better software engineering that we were talking about. And this is w when we say enabled, almost
entirely downstream of the fact that this reinforced learning with verifiable rewards training just kind of let the models pick up these skills very easily. So let the models learn. So if you look at the reasoning process when the models are generating a lot of tokens, what it'll be often doing is it tries a tool, it looks at what it gets back, it tries another A_P_I_, it sees what it gets back and if it solves the problem. So the models when you're training them very quickly learn to do this and then at the end of the day that gives this kind of general foundation where the model can
use C_L_I_ commands very nicely in your repo and handle git for you and move things around and organise things or search to find more information, which if we're sitting in these chairs a year ago, it's something that we didn't really think of the models being doing. So this is just kind of something that has happened this year and is totally transformed how we think of using A_I_, which I th think is very magical. It's such an interesting evolution and just so p unlocks so much value. But it's uh it's like it's not clear what the next
avenue will be in terms of unlocking stuff like this. I think that there's there's uh we'll get to continual learning later, but there's a lot of buzz around certain areas of A_I_ but no one knows when the next step function will r will really come.
So you you've actually said quite a lot of things there and said profound things quickly. It would be nice to unpack them a little bit. You say you're bullish basically on every version of scaling. So can we just even start at the beginning. Pre-training. Are we kind of implying that the low hanging fruit on pre-training scaling has been picked? Is is h is pre-training hit a plateau or is even pre-training still you're bullish on?
has gotten extremely expensive. I think to scale up pre-training, it's also implying that you're gonna serve a very large model to the users. So I think that it's been loosely established that the likes of G_P_T_ four and similar models were around one trillion like this order of trillion parameters at the biggest size. There's a lot of rumors that they've actually gotten smaller as training has gotten more efficient. P you want to make the model smaller because then your costs of serving go down proportionately. These models, the cost of training them is really low
relative to the cost of serving them to hundreds of millions of users. I think DeepSeek had this famous number of about five million dollars for pre-training at cloud market rates, I thi almost three. Um section two point four in the paper we'd just detailed how long we had the G_P_U_ clusters sitting around for training, which includes engineering issues, multiple seeds, and it was like about two million dollars to rent the cluster to like deal with all the problems and headaches of training a model. So these models are pretty like a lot of people could get one to
million dollars to train a model, but the recurring costs of serving millions of users is really billions of dollars of compute. I think that you can look at close like a thousand G_P_U_ rental you, can pay a hundred grand a day for, and these companies could have millions of G_P_Us. Like you can look at how much these things cost to sit around. So that's kind of a big thing and then it's like if scaling is actually giving you a better model, like is it gonna be financially worth it? And I think it'll kind of slowly will push it out as A_I_ solves more compelling
past. So like the likes of Claude Opus four point five making, Claude code just work for things. I think I've I launched this project called like the Adam project, which is like American truly open models in July. And that was like a true vibe coded web site. And like I have a job um make plots and stuff. And then I came back to refresh it in the last few weeks and it's like Claude Opus four point five versus whatever model at the time was like just crushed all the issues that it had from building in June and July. And like it might be a bigger model there's, a
other things that go into this, but that's like there's still progress coming.
So so what you're speaking to is the nuance of the Y_ axis of the scaling laws, that the the way it's experienced versus on a benchmark, the actual intelligence is migh might be different. But still, your intuition about pre-training, if you scale the the size of compute, will the models get better Not? whether it's financially viable, but just from the law aspect of it. Do you think the models will get smarter?
Yeah. And I think that there's and this sometimes comes off as like almost like disillusioned from people leadership, like AI companies saying this, but they're like it's held for thirteen orders of magnitude of computers, like why would it ever end? So I think fundamentally it it is pretty unlikely to stop. It's just like eventually we're not even gonna be able to test the bigger scales because of all the problems that come with more compute. I think that there's a lot of talk on how twenty twenty six is a year when very large Blackwell compute
it's like gigawatt scale facilities, so hyperscalers are coming online. And these were all contracts for power and data centres that were assigned and sought out in like twenty two and twenty twenty three, so before or right after chat G_P_T_ So. it took this two to three year lead time to build these bigger clusters to train the models. Well there's obviously immense interest in building even more data centres than that. So that is like the kind of the crux that people are saying is like these new clusters are coming, the labs are gonna have more compute for training,
going to utilise this but it's not a given and it's like I I've seen so much progress that I expect it and I expect a little bit bigger models and I expect um I would say it's more like we will see a two thousand dollar subscription this year we, see two hundred dollar subscriptions it's like that can ten X_ again and these are the kind of things that could come and they're all downstream of this like bit big bit bigger model that offers just a little bit more cutting edge.
So you know it's reported that X_A_I_ is gonna hit that uh one gigawatt scale early twenty six and full two gigawatt by year and w how do you think they'll utilise that in the context of scaling laws? Is is a lot of that inference is a lot of that training?
it ends up being all of the above. So I think that all of your decisions when you're training a model come back to pre-training. So if you're gonna scale R_L_ in a model, you still need to decide on your architecture that enables this. We were talking about like other architectures then uh using different types of attention. We're also talking about mixture of experts models. This sparse nature of M_O_E_ models makes it much more efficient to do um generation, which becomes a big part of um post-training. And it's like you need to have your
picture ready so that you can actually scale up this compute. I still think most of the compute is going in at pre-training. Because you can still make a model better, you still want to go and revisit this. You will still want the best base model that you can. And in a few years that'll saturate and the the R_L_ compute will just go longer.
Is there are people who disagree with you that say basically pre-training is dead, it's all about scaling inference, scaling pulse training, scaling context, continual learning, uh scaling data, synthetic data.
People vibe that way and describe it in that way, but I think it's not the practice that is happening.
this thing's dead.
Yeah.
Yes.
Mm-hmm So. reasoning is
if you get a new compute cluster that lets you do something maybe more stably or faster, 'cause like you hear a lot about Blackwell having roll-out issues where at A_I_ two most of the models were pre-training around like one to two thousand G_P_Us, but when you're pre-training on ten thousand or a hundred thousand G_P_Us you hit very different failures. So G_P_Us are known to break in weird ways, and doing a hundred thousand G_P_U_ run is like you're pretty much guaranteed to always have at least one G_P_U_ that is down. And you need to have your training code handle that redundancy, which is just a very different problem, where it's like well we're
doing like oh I'm playing with post-training on a D_G_X_ Spark or you have your book it's like or people learning M_L_ it's like what they're battling to train these biggest models is just like mass distributed scale and it's a very different but that's somewhat different than like are these like that's a systems problem in order to enable the scaling laws especially of pre-training you need all these G_P_Us at once. When we shift to reinforcement learning it actually lends itself to heterogeneous compute because you have
any copies of the model. And to do a primer for a language model, reinforced learning, what you're doing is you have two sets of G_P_Us. One is you can call it the actor, and one you call the learner. The learner is where your actual reinforced learning updates are gonna do. These are traditionally policy gradient algorithms, um proximal policy optimisation P_P_O_, and group relative pov policy optimisation G_R_P_O_, are the two popular classes. And w on the other side you're gonna have actors
which are generating completions and these completions are the things that you're gonna grade. So reinforced learning is all about optimising reward. And in practice what you can do is that you can have a lot of different actors in different parts of the world doing different types of problems and then you send it back to this highly networked compute cluster to do this actual learning where you s where you take the p where you take the gradients and you need to have a tightly meshed network where you can do different types of parallelism and spread out your model for efficient training. So there's just like a lot of
every different type of training and serving has these considerations you need to scale. Like we talked about pre-training, we talked about R_L_ and then inference time scaling is like how do you serve a model that's thinking for an hour to a hundred million users. I'm like uh I don't really know about that, but I know that's a hard problem and in order to give people this intelligence there's all the systems problems that we need more compute and you need more stable compute to do it.
But you're bullish on all of these kinds of scaling is what I'm hearing. On the inference, on the reasoning, even on the pre-training.
Yeah, so that's a big can of worms here. But uh so there are basically two the knobs are the training and uh uh inference scaling where you can get gains. And so in an in a world where we had let's say infinite compute resources, you wanna do all of them abo like so you have training, you have inference scaling, and training is like a hierarchy, it's pre-training, mid-training, post-training. Changing the model size, more training data, making training a bigger model gives you more knowledge in the model. The the model um let's say has a better it's like a better base model
uh back in the day I mean still we call it foundation model and it unlocks so you un but you don't let's say have the model be able to solve your most complex task tasks during pre-training or after pre-training. You still have these other unlock phases where you have mid-training or non-context for example post-training with a L_R_V_R_ that unlocks capabilities that the model has in terms of just knowledge in the pre-training. And I think sure if you uh so do more pre-training you get a better base model that you can unlock later, but like Nathan said it
just becomes too expensive. So we don't have infinite compute. So you have to decide, do I want to spend that compute more on making the model larger? But you know, it's like a trade-off. It's it's like in an ideal world you want to do all of them. And I think in that sense, scaling is still pretty much alive. You would still get a better model. But like we saw with G_P_D_ four point five, it's just not worth it. I mean it's like 'cause you can let's say you can unlock more performance with other techniques at that current moment. Especially um if you look at inference scaling, that's one of the biggest gains this year with
um where it took a smaller model further than pre-training a larger model like G_B_D_ four point five. So it's it's like it I wouldn't say pre-training scaling is that, it's just like there are other more attractive ways to scale right now at the moment. But at some point, you know, will y you will still wanna make some progress on the pre-training. The thing is also to consider um where you w why do you wanna spend your money. If you spend it more on the pre-training, it's like a fixed cost, you train the model, and then it has this capability forever. You you can always
user uh and so forth. With inference scaling, you don't spend money during training, you spend money later per query, and then it's also like the math how long is my model gonna be on the market if I replace that in half a year. Maybe it's not worth spending five million ten million hundred million dollars on the training it longer. Maybe it's just I will just do more inference scaling and get the performance from there. It maybe cost me two million in terms of user queries. It becomes a question of how many users you have and then doing the math um and I think that's
also where it's interesting where J_G_B_D_ is in a position I think they have a lot of users where they need to go a bit cheaper, where they have that uh J_B_D_ five model that is a bit smaller. Other companies that have I say if your customers have other uh other um trade-offs for example there was also the math Olympiad or some of these these math uh problems where J_G_B_T_ uh or maybe they had a proprietary model and I'm pretty sure it's just like a model that has been maybe fine-tuned uh a little bit more, but most of it was doing inference scaling to achieve
speak performance in certain task where you don't need that all the time. And but yeah, long story short, I do think all of these uh pre-training, mid-training, post-training, uh inference scaling, they are all still things you wanna do. It's just finding uh at the moment in this year it's finding the right ratio that gives you the best bang for the buck basically.
I think this might be a good place to define pre-training, mid-training, and post-training.
So pretraining is the classic training of one next token prediction at a time. You have a big corpus of data and uh Nathan also has very interesting insights there because of OMO three. It's a big portion of the paper focuses on the right data mix. So pretraining is essentially just you know train across entropy loss training on next token prediction on a uh vast corpus of internet data books papers, and so forth. It has changed a little bit over the years in a sense people used to throw in everything they can. Now it's not
just raw data, it's also a synthetic data where people re um let's say rephrase certain things uh so synthetic data doesn't necessarily mean purely A_I_ made up uh data, it's also taking something from an article Wikipedia article and then rephrasing it as a Q_ and A_ question or um summarising it, reverting it and and making uh better data that way 'Cause. I think of it also like with humans, if someone let's say reads a book compared to a
I dunno, an offence but like Reddit posts or something like that. I do think you learn you n no offence uh but I think
There's gonna be a post about this. So I should I
data is very coveted and excellent for training. You just have to filter it. I I think that's the idea Uh. I I think it's like w if someone took that and rephrases that in a s let's say c a more concise and structured way, I think it's higher quality data that gets the L_M_ maybe the same uh you get the same L_M_ out of it at the end, but it gets there faster. It trains faster because the let's say if the grammar and the punctuation is correct, it already learns the correct way versus getting
information from a messy way and then learning later how to correct that and stuff like that. So I think that is how pre-training evolved and how um w how still while w why s scaling still works is that it's not about just amount of data, it's also the tricks to make that data b better for you in a sense. And then mid-training is I mean it used to be called uh pre-training, it's I think it's called mid-training because it was awkward to have pre-training and post-training but nothing in the middle, right, it sounds a bit weird you have pre-training and post-training, but what's
actual training. So the mid-training is usually similar to a pre-training, but you know, it's a bit more, I would say, specialized in pre-training. It's the same algorithm, but what you do is you focus, for example, on long conta uh like it's uh one example, you have long context documents. The reason you don't do that during just pure pre-training is because you don't have that many long context documents. So you have a specific phase, and one problem of L_E_M_S_ is also still it's a neural network. It has the problem of catastrophic forgetting. So you teach it something, it forgets other
And you wanna it's not a hundred percent forgetting, but you know, it's like no free lunch, you can't it's also the same with humans. If you ask me some math I learned ten years ago, I don't know, I would have to look at it again.
Uh Nathan was actually saying that he's consuming so much content that there is a catastrophic forgetting issue.
Yeah, I'm like trying to learn so much about A_I_M_ it's like I was learning about pre-training parallelism, I'm like I lost something and I don't know what it was.
Mm-hmm. Mm-hmm.
pre-training but, I I mean I don't think anyone does that in production.
Toy examples for now huh, But? to generalise parallel uh post-training is more like the skill unlock where pre-training is like soaking up the knowledge essentially. Um
a few things that could be helpful for people. A lot of people get like they have think of synthetic data as being bad for training the models. You mentioned like the deep sea get it almo uh s O_C_R_, which is optical character recognition paper. A lot of labs did. A_I_ two had one that look had multiple and the reason that each of these labs has these is because there's vast amounts of P_D_F_s and other digital documents on the web that are in formats that aren't encoded with text easily. So you use these almost C_R_ these or
C_C_O_C_R_, and we called our L_M_O_C_R_ to extract what can be trillions of tokens of um candidate data for pre-training. And pre-training data set size is on the order of trillions is measured in trillions of tokens. Smaller models from researchers can be something like five to ten trillion, um QUINN_ is documented going up to like fifty trillion, and there's rumors that these closed labs can go to like a hundred trillion tokens. And just getting this potential data to put in, I think they they have a very big funnel, and then the data you actually train the
on is a small percentage of this, like the sin this character recognition data would be described as synthetic data for pre-training in a lab. And then there's also the things like chat G_P_T_ now gives wonderful answers and you can train on those best answers and that's synthetic data. It's very different than like early chat G_P_T_ lots of hallucinations data when people became grounded in synthetic data.
One interesting question is if I recall correctly, almost three was trained with less data than uh specifically some o other open-weight models, maybe even almost two, but you still have better performance and that might be one of the examples how the data help.
It's mostly down to data quality. I think if we had more compute we would train for longer. I think we'd ultimately see that as a like just like something we would want to do. And especially with big models you need to have more compute because we talk about having more parameters and we talk about knowledge. And essentially there's a ratio where big models can absorb m more from data and then you're gonna d you get more benefit out of this. It's it's it's like one of these any logarithmic graph in your mind is like a small model will level off sooner if you're measuring trends of tokens and bigger n bigger models need more. But
is we aren't training that big of models right now with A_I_ two and getting the highest quality data we can is the natural starting point.
Is there something to be said uh about the topic of data quality? Is there some low hanging fruit there still where the quality could be improved?
it's like turning the crank. So I think historically in the open there's been like a canonical best pre-training data set that has moved around between who has the most recent one or the best or the recent effort like A_I_ two's, Dolmo was very early with the first Olmo and hugging face had fine web and there's a D_C_L_M_ project which has been kind of like a which is it stands for data comp language model there's, been data comp for other machine learning projects and they have a v had a very strong data set and a lot of it is the internet is
coming fairly closed off so we have common crawl which I think is hundreds of trillions of tokens and you filter it and it looks like being a lot of scientific work where you're training classifiers and making decisions based on how do you prune down this deci this data set into the highest quality stuff and the stuff that suits your tasks. So previously language models were tested a lot more on like knowledge and just kind of conversational things but now they're expected to do math and code. So to train a reasoning model you need to remix your whole data set and there's a lot of actually wonderful
methods here where you can you can like take your gigantic data set, you sample a lot of really tiny things from different sources. So you say you have GitHub, stack exchange, Reddit, Wikipedia, you can sample small things from them and you train small models on each of these mixes and measure their performance on your evaluations. And you can just do like basic linear regression and it's like here's your optimal data set. But if your evaluations change, your data set changes a lot. So a lot of old mode three was new sources for reasoning to be better at math and code t and then you do this mixing procedure and it gives you the
And I think that's a lot of that's happened at labs this year. It's like there's new hot things, whether it's like coding environments or web navigation, and you just need to bring in new data. You need to change your whole pre-training so that your post-training can work better and stuff like this. And that's like the constant re re-evolution and the re-determining of what they care about as their for their models.
Are there fun anecdotes of what sources of data are particularly high quality that we wouldn't expect? You mentioned Reddit sometimes can be a source.
Reddit was very useful. I think that um like P_D_F_s is definitely one.
Or especially archive.
Yeah so like A_I_ two has run Semantic Scholar for a long time which is a um like a what you can say as a competitor to Google Scholar with a lot more features and to do this A_I_ two has found and scraped a lot of P_D_F_s for openly accessible papers that might not be um like behind the pl closed paid garden of a certain publisher. So like truly open scientific P_D_F_s and if you like you sit on all of these and you process it and you can get value out of it and I think that like a lot of that style of work has been done by the
the frontier labs much earlier and it's just like you need to have a pretty skilled researcher that understands how things change models and they bring it in and they clean it and it's that's a lot of labour that like I think of a lot of frontier labs when they scale researchers a lot more it goes into data. You have people like if you wanna m if you join a frontier lab and you wanna have impact the best way to do it is just make find new data that's better. And then like the fancy glamorous algorithmic things like figuring out how to make O_ one is like the sexiest th thought of a
Yeah. and there's a group that did that but I think of most of the contributions is like I'm gonna make the data better or I'm gonna make the infrastructure better so that everybody in my team can run experiments five percent faster.
only licensed data where common crawl is a scrape of like the whole internet. So um if I I host multiple websites I'm m happy to have them train language models but I'm not explicitly licensing what governs it and therefore this lis the common crawl is largely unlicensed which means that your consent really hasn't been provided for how to use the data. There's another idea where you can train language models only on data that has been licensed explicitly so that the kind of governing contract is provided and I'm not sure if APRIS is the
right thing or the license thing. I know that the reason that they did it was for an E_U_ compliance thing where they wanted to make sure that their model um fit one of those checks.
said they m just purchased the license. I'd say they p buy a book online, uh let's say an Amazon Kindle book or let's say a mining book or something and then use that in the training data. And that is like the grey zone because you paid for the content and you might wanna train it.
but then there are also restrictions where even that shouldn't be allowed and so that that is like where where it gets a bit fuzzy and w yeah I think that is right now it is still a hot topic and also big companies like OpenAI they approached private companies for their proprietary data and private companies they've become more and more let's say uh uh protective of their data because they know okay this is gonna be my mode in in a few years and I do think um that's like the interesting question where
if L_M_M_s become more commoditized, and I think a lot of people learn about L_M_M_s, uh there will be a mo lot more people able to train L_M_M_s. Of course there are infrastructure challenges, but if you think of big industries like uh pharmaceutical industries, law, finance industries, I do think they at some point will hire people from other uh frontier labs to build their in-house models on their proprietary data, which will be then again another unlock with pre-training that is currently not there, because even if you wanted to, you can't get that data you, can't get access to
trials most of the time and these types of things. So I I do think m scaling in that sense might be still pretty much alive if you also look in s domain specific applications, because we are still right now in this year just looking at general purpose L_M_M_s on on G_G_P_D_, anthropic and so forth. They are just general purpose. They're not even I think s scratching the surface of what an L_M_M_ can do if it is really specifically trained and designed for a specific task.
I think on the data thing some this is one of the things where like this happened in twenty twenty five and we totally forget it is, enthropic lost in court and was owed a one point five billion dollars to authors, enthropic I think bought thousands of books and scanned them and was cleared legally for that because they bought the books and that is kind of going through the system. And on the other side they also torrented some books and I think this torrenting was the path where the court said that they were then culpable to pay this billions of dollars to authors which is just like such a mind boggling lawsuit that kind of just
came and went, like that is so much money from the V_C_ ecosystem.
These are core cases that will define the future of human civilization 'cause it it's clearly that data drives a lot of this and there's this very complicated human tension of I mean you can empathize, you're both authors. Yeah, there's some degree to which I mean you put your heart and soul and your your sweat and tears into the r the writing that you do, uh it feels a little bit like theft for somebody to train your data without giving you credit.
two layers to it. Someone might buy the book and then train on it, which is uh it could be argued fair or not fair, but then there are the three straight-up um companies who use pirated books where it's not even compensating the author. It's that that is I think where people got a bit uh angry about it specifically.
there has to be some kind of competition scheme. This is like moving towards towards something like Spotify streaming did originally for music. You know, what does that competition look like? You have to define those kinds of models, you have to think through all of that. Uh one other thing I think people are generally curious about, I'd love to get your thoughts, as L_M_s are used more and more, if you look at even archive, but GitHub, more and more of the data is generated by L_L_M_s, what do you do in that kind of world?
Is how big of a problem is that?
Well there are just problems in infrastructure and systems, but from an A_I_ point of view it's kind of inevitable.
So it's basically L_L_M_ generated data that's curated by humans essentially, right?
Yes, and I think that a lot of open source contributors are legitimately burning out. If you have a popular open source repo, somebody's like oh, I wanna do open source A_I_ it's good for my career, and they just vibe code something and just and they throw it into the p you might get more of this than I do. So I have a c
a case study here um that I have a uh repository called M_L_X_T_END uh that I developed as a student around fifteen years ten years ago. And it is a reasonably popular library still for certain algorithms I think, especially like frequent data mining stuff. And there was recently I think two or three people who submitted a lot of P_R_s in a very short amount of time. I do think L_M_s have been involved in submitting these P_R_s. Me as the maintainer, there are two things. First I'm a bit overwhelmed, like I I don't have time to
through it because especially it's an older library that is not a priority for me. At the same time I kind of also appreciate it because I think something people forget is it's not just using the L_L_M_ there's still a human you have a human layer that verifies something and uh and that is in a sense also how data is labelled right, So. that's like um one of the most expensive things is get getting labelled data for R_L_ back um human feedback uh phases. And and this is kind of like that where it goes through phases and then you get actually higher quality data out of it, you know. It's so I
I I don't mind it in a sense. It it can feel overwhelming. But I do think there is also value in that.
It feels like there's a fundamental difference between raw L_L_M_ generated data and L_L_M_ generated data with human in a loop that does some kind of verification, even if that verification is a small percent of the the lines of code.
I think this goes with anything um like where people think also sometimes oh, y I can just use an L_L_M_ to learn about X_Y_Z_ which is true, you can, but there might be a person who is an expert who might have used an L_L_M_ to write s s specific code. There is kind of like this human work that went into it to make it nice and throwing out the not so nice part, to make it to kind of like pre-digest it for you and that saves you time. And I think that's uh it's that that's the value add where you have someone
things or even using the L_L_M_s correctly. I think this is still labour that that you get for free with you for, example, read an article, let's say sub-stake article. I could maybe ask an L_L_M_ to give me opinions on that, but I wouldn't even maybe know what to ask. I think there is still value in reading that article compared to me going to the L_L_M_ because you are the expert, you select what knowledge is actually spot on should be included and you give me this very this this um uh executive summary. And and this is kind of uh
huge value at because now I don't have to waste three, five hours to go through this myself, maybe get some incorrect information and so on and so I think that's also where the future still is for writers, even though there are uh L_L_M_s that expert can kind of like save your time.
It's kinda fascinating to actually watch uh did uh I'm sure you guys do this, but uh for me to look at the difference between the summary
and the original content, even if it's a page long summary of a page long content, it's interesting to see how the summary L_M_B_ summary takes the edge off, like what what is the signal it removes from the thing.
The voice is what I talk about a lot.
voice well voice uh I would love to hear what you mean by voice that's really powerful, but sometimes there's like literally insights. Like in removing an insight you're actually fundamentally changing the meaning of the thing. So I'm uh continuously disappointed how bad L_M_s are at really getting to the core insights, which is what a great summary does. Y even if you go and I have these extensive extremely elaborate prompts where I'm like really trying to dig for the
And it's still not quite there, which um I mean that's a whole deep philosophical question about what is human knowledge and wisdom and what does it mean to be inside phone and so on. But when you talk about the voice, what do you mean?
So when I write, I think a lot of what I'm trying to do is take what you think as a researcher, which is very raw, which a researcher is trying to encapsulate an idea at the frontier of their understanding, and they're trying to put what is a feeling into words. And I think that my writing, I tried to do this as the writing, which makes it come across as raw, but also high information in in a way that it's like some people will get it and some won't, and that's kind of the nature of research. And I think this is something that language models don't do well,
particularly they're all trained with this reinforcement learning from human feedback, which is designed to take feedback from a lot of people and in a way average how the model behaves from this. And I think that there's it's going to be hard for a model to be very incisive when there's that sort of filter in it. And I think this is kind of a wonderful fundamental problem for researchers in R_L_H_F_ is like this provides so much utility in making the models better, but also the f problem formulation is kind of like there there's this
not in it that you can't get past. So that's what I think of as like these language models don't have this prior and they're deep expression that they're trying to get at. I don't think it it's impossible to do. I think they're stories of models that really shock people. Like I think of like I would love to have tried Bing Sydney. And does like does that have more voice? 'Cause it would so often go off the rails on people and infi and what is historically obviously a scary way, like telling a reporter to leave his wife is a crazy model to potentially put in general uh general
But that's kind of like a trade-off like is this R_L_H_F_ process like in some ways adding limitations.
That's a terrifying place to be as one of these frontier labs and and companies because millions of people are using them.
There was a lot of backlash last year with the G_P_T_ four O_ getting removed and I personally never used the model but I've talked to people at OpenAI where they're to the point where they like get emails from users that might be detecting subtle the differences in the deployments in the middle of the night and they email them and they're like my friend is different and they like find these people employees emails and send them things because they are so attached to this set but there's a set of model weights and a configuration that is deployed to the users.
We see this with TikTok. You open it s I I don't use TikTok. Supposedly in like five minutes the mo the algorithm gets you. It's like it's locked in. And I don't s like those are language models doing recommendations. Like I think there are ways that you could do this with a language model. Within like five minutes of chatting with it the model just gets you. And that is something that people aren't really ready for. Like I think that if like kid like don't give that to kids like don't give that to kids, at least until we know what's happening.
Mm-hmm.
do is they will say, well, the suicide was committed because of the L_O_M_. And that's going to lead to the companies because of legal issues and so on, more and more and more taking the edge off of the L_O_M_. So it's going to be as generic as possible. It's so difficult to operate in this space because of course you don't want an L_O_M_ to cause harm to humans at that level. But also this is also the nature of the human experience is to have a rich conversation, a fulfilling
conversation and one that challenges you from which you grow, you need that edge. And that that that's something that's extremely difficult for A_I_ researchers on the R_L_H_F_ front to actually have to solve. 'Cause you're actually s dealing with the human condition.
Like a lot of researchers at these companies are so well motivated and there's definitely the the like cementropic and opening eye are culturally so want to do good through this for the world and there is it's such a the c I'm like I don't wanna work on this because on the one hand a lot of people see A_I_ as a health ally as somebody they can talk to about their health confidentially, but then it bleeds all the way into this l like talking about mental health and things where that's a s it's
that this will push like be the thing where somebody goes over the edge but, other people might be saved and I'm like I don't like uh there's things that as a researcher training models it's like I don't wanna train image generation models and release them openly 'cause I don't wanna enable somebody to have a tool on their laptop that can harm other people like I don't have the infrastructure at my company to do that safely but, it's like like there's a lot of the areas like this where it's just it needs people that will approach it with the complexity and just kind of conviction of like s it's just
to heart problem. But also we as a society as users of these technologies need to make sure that we're having the complicated conversation about it versus just fear-mongering. Big tech is is causing harm to humans or stealing your data, all that kind of stuff. There is is more complicated than that, and you're right. There's a very large number of people inside these companies, many of which you know, many which I know that deeply care about helping people. They are considering the full human experience of people from across the world, not just Silicon Valley, people across the United States, people
across the world what that means, what their needs are. It's really difficult to design this one system that is able to help all these different kinds of people across the different age groups, cultures, mental states mental, conditions, all that kind of stuff.
I wish that the timing of A_I_ was different with the relationship of big tech to the average person. So like big tech's reputation was so low and with how A_I_ is so expensive it's like inevitably gonna be a big tech thing where it takes so many resources and people say that U_S_ is quote unquote betting the economy on A_I_ with this build-out and it's like to have these be intertwined at the same time is just makes for such a hard communication environment. It would be good for me to go talk to more people in the world that hate big tech and C_A_I_ is a
continuation of this.
And one of the things you actually recommend, one of the antidotes that you talk about is uh to find agency in this whole system, as opposed to sort of sitting back in a powerless way and consuming the A_I_ slop as it quickly rapidly takes over the internet. More find agency by using it to build stuff build, apps build. So you one that actually helps you build the intuition, but two it's empowering because you you can understand how
works, what the weaknesses are, and it allows it gives your voice power to say like this is fucked up, this is bad, this is bad, use of the technology, and this is good use of technology. And you're more plugged into the system than so you can understand it better and you can steer it better as as a as if it's you.
is a good point you brought up agency. Instead of ignoring it and saying okay, I'm not gonna use it, I think it's probably long-term healthier to say okay, it's out there, I can't put it back, you know, like internet computers back then when they came out. How do I make best use of it and how does it help me to up-level myself? The one thing I worry here though is like if you just fully use it for something you love to do, the the thing you love to do is not no no longer there, and that could potentially I feel like lead to burn-out. For example if I use an
them to do all my coding for me. Now there is no coding, I'm just managing something that is coding for me. Two years, let's say later, if I just do that eight hours a day, have something code for me, do I feel fulfilled still? Like is this like yeah, I mean is this like hurting me in terms of being excited about my job, excited about what I'm doing, am I still proud to build something?
So there's uh on that topic of enjoyment, it's c it's quite interesting we should just throw this in there that there is this recent survey of about seven hundred and ninety one professional developers, professional meaning ten plus years of experience.
That's a long time.
Yeah. Uh yeah, in this day and age. Uh so the the results here on many fronts are uh surprising. So they break it down by junior and senior developers. But I mean it just shows that both junior and senior developers u u use A_I_ generated code in code they ship. So this is not just for fun sort of intermediate kind of learning things. This is code
ship and so it's twenty five percent meant like most of them use around fifty percent or more. And what's interesting is for the category of over fifty percent of your code that you ship as A_I_ generated, senior developers are much more likely to do so. But you don't want A_I_ to take away the thing you love. I think it speaks to my experience these particular results I'm about to say. So together about eighty percent of people find it either somewhat more enjoyable or significantly more enjoyable
to use A_I_ as part of the work?
I think it depends on the task um from my personal uh usage for example. I have a website where I sometimes tweak things on the website. I personally don't enjoy this. So in that sense if the AI can help me to implement something on my website, I'm all here for it. It's it's great. But then at the same time when I solve a complex problem, well if there's a bug and I hunt this bug and uh I find the bug, it's the best feeling in the world. It's like you get so much joy like oh it's like you feel like
great. But now if you don't even think about thinking about the bug, you just go directly to the L_L_M_, well you never have this kind of feeling, right. But then there could be the middle ground where well you try yourself, you can't find it, you use the L_L_M_ and then you don't get frustrated because it helps you and you move on to something that you enjoy. And so I think looking at these statistics, I think also the difference is what is not factor in its averaging over all the different scenarios where we don't n so we don't know if it's for the core task or if it's
for something mundane that people would not have enjoyed uh otherwise. So in a sense A_I_ is really great for doing mundane things that um take a lot of work. Um so for example my wife the other day uh she has like a podcast for like book uh like book discussions, a book club, and she was like transferring the show notes from um Spotify to YouTube, and then the links somehow broke. Uh and she had in some episodes because it is custom many books like hundred links or something, and it would have been really painful to go in there and fix each link
manually. And so I suggested hey let's try chat G_P_T_, we copied the text into chat G_P_T_ and it fixed them and it instead of two hours going from link to link fixing that, you know, it made that type of work much more seamless, there was no frustration fixed. I think everyone has a use case where A_I_ is useful for something like that that would be really boring really, mundane.
I for me personally since we're talking about coding uh and you mentioned debugging uh would b a lot of the sources of enjoyment from me on more on the cursor side than the clock code side is the I have a friend, I have a co what's that called a, pair programmer. Like uh it's less l lonely. You you made debugging sound like this great joy No. I would say I would say debugging is like a drink of water after you've been going through a
desert for for for days. So like you're you skip the whole desert part where you're suffering. So like there sometimes it's nice to have a friend who's who's can't really find the bug but can give you some intuition about the code, and you're together uh the with that friend going through the desert and then together find that drink of water. So I at least from me uh maybe speaks of the loneliness of the programming experience. It's uh that is a source of joy.
Mm-hmm.
Mm-hmm.
there if you can solve it then it's great. But there's also like a sweet Goldilocks zone where if it's too hard then it's you know wasting your time. But uh I think that is another challenge though. How will people learn? I mean the chart we looked at um we saw that more senior developers are shipping more A_I_ generated code than the junior ones and I think it's uh very interesting because intuitively you would think it's the junior developers because they don't know let's say how to do the thing yet because they are more junior and so they use A_I_
to do that thing. It could either mean the A_I_ is not good enough yet to solve that task, but it could also mean experts are more effective at using it. They know where and better how to use it and review the code, and they trust the code then more. And so I think o one issue in the society in the future will be though, how do you become an expert if you never try to do the thing yourself. And I think one way it's always like for me how how I learn is by trying things myself, like math textbooks, if
look at the solutions. Yeah, you learned something, but I think you learn actually better if you try first and then you appreciate the solution differently because you know how to put it into your mental framework. And um if L_M_s are here all the time, would you actually go through the length at struggling? Would you s be willing to struggle? Because struggle is not nice, right. I mean it's struggling. And if you use the L_M_ to do everything at some point, you will never really take the next step. And then you will maybe not get that unlock that you
get as an expert using an L_L_M_. So mm it's like you know it's like I think there's like a Goldilocks sweet spot where maybe th maybe the trick here is you make dedicated off-line time where you study two hours a day and the rest of the day use L_L_M_s. But uh I th I think it's important also for peoples to still invest in themselves in my opinion to not just you know L_L_M_ everything.
Yeah there is and uh we together as civilization that we each individually have to find that Goldilocks own uh and in the program and context as developers. Now we had this fascinating conversation that started with pre-training and mid-training. Let's get to post-training. A lot of fun stuff in post-training. So what are some of the interesting ideas in post-training?
Mm-hmm.
lot of this kind of iterative generate grade loop, and that lets the models learn both interesting behaviours on the tool use and software side. This could be searching, running commands on their own and seeing outputs, and then also that training enables this inference time scaling very nicely. And it just turned out that this paradigm was very nicely linked in this, where it's this kind of R_L_ training enables inference time scaling. But inference time scaling could have been found in different ways. So it was kind of this perfect storm of the models change a lot, and the
way that they're trained is a m major factor in doing so, and this has changed how people approach post-training dramatically.
Can you describe R_L_V_R_ popular by deep seek R_ one? Can you describe how it works?
Yeah, fun fact, um I was on the team that came up with the term R_L_V_R_, which is from our two to three work before deep seek, which is we don't take a lot of credit for the being the people to popularize the scaling R_L_, but it is fun as what academics get as an aside is the ability to name and influence the discourse, because the closed labs can only say so much that one of the things you can do as an academic is like you might not have the compute to train the n the model, but you can frame things in a way that ends up being I,
but it's like a community can come together around this R_L_V_R_ term which is very fun. And then deep seek is the people that did the training breakthrough which is they scaled the reinforcement learning which was you'd have the model generate answers and then grade the completion if it was right, and then that accuracy is your reward for reinforcement learning. So reinforcement learning is classically an agent that acts in an environment and the environment gives it a state and a and a reward back and you try to maximise those
reward. In the case of language models, the reward is normally accuracy on a set of verifiable tasks, whether it's math problems, coding tasks, and it starts to get blurry with things like factual domains, like that is also in some ways verifiable or constraints on your instruction, like respond only with sent words that start with A_. Like all of these things are verifiable in some way and the core idea of this is you find a lot more of these problems that are
viable and you let the model try it many times while taking these R_L_ steps, these R_L_ gradient updates, the infrastructure evolved from this reinforced learning from human feedback, where in that era the score they were trying to optimise was a learned reward model of aggregate human preferences. So you kind of change the problem domains and that let the optimisation go on to much bigger scales which kind of kick-started a major change in what the models can do and how people use them.
What kind of domains is uh R_L_V_R_ manageable to?
math and code are the famous ones, and then there's a lot of work kind of on s what is called the rubrics, which is related to a word people might have heard as L_M_ as a judge, which is like for each problem I'll have a set of problems in my training data set. I'll then have another language model and ask it what would a good answer to this problem look like. And then you could try the problem a bunch of times over and over again and assign a score based on this rubric. So that's not necessarily verifiable like a math and code domain, but this rubrics
and other scientific problems that it might be a little bit more vague is where a lot of the attention is, where they're trying to push this set of methods into these kind of uh more open-ended domains so the models can learn a lot more.
I think that's called reinforcement running with A_I_ feedback, right?
That's the older term from it that was coined in enthropics constitutional A_I_ paper. So it's like a lot of these things come in cycles.
also just one step back for the R_L_L_V_R_. So I think the interesting, beautiful thing here is that you ju you ask the L_M_ a c let's say a mass question, and then you know the correct answer. And you let the L_L_M_ like you said figure it out. But how it does that mm I mean you don't really constrain it much. There are some constraints you can add like use the same language, don't switch between Spanish and English. But let's say you're pretty much hands off, you only give the question and the answer. And then the L_M_M_ has to you know just the task to arrive at the
uh uh right answer. But the beautiful h thing here is what happens in practice is that the L_M_ will do a step-by-step description. Like you know like as a student or like as a m yeah, m mathematician how you would derive the solution. It will give you well it will use those steps and that helps actually the model to improve its own accuracy. And then like you said the inference scaling. So inference scaling loosely means basically spending more compute during using the L_M_ during inference. And here the inference scaling is that the
would use more tokens. And and also I think in the R_ one paper they showed the longer they train the model, the longer the responses are. They they grow over time. They use more tokens. So it becomes more expensive becomes more expensive for simple tasks. But these explanations they, help the model with the accuracy. There are also interesting lot uh lot of papers showing what the model explains does not necessarily have to be correct. Or it maybe it's even unrelated to the answer. But s for some reason it still helps the model. Like this is the fact that it is um explaining. And I think it's also again, I
wanna anthropomorphize these L_M_M_s, but it's kinda like how we humans operate, right. If there's a complex math problem, let's say in a math uh class, you you usually have a notepaper and you do it step by step, you cross all things. And the model also self-corrects and that that was I think the AHA moment in the R_ one paper, they called it AHA moment because the model itself recognised it, made a mistake and then said ah, I did something wrong and so let me try and I think that's just so cool that this falls out of just giving it the correct answer and having it
out how to do it that it kind of does in a sense what a human would do although, L_M_s don't think like humans it's kind of like an interesting coincidence and it and the the other side a nice side effect is it's great for us humans often to see these steps it builds trust, but also we learn we can double-check things.
there's a lot in here I think some of the debate there's been a lot of debate this year on if the language models like these aha mo I think the aha moments are kind of fake because then pre-training you essentially have seen the whole internet so you have definitely seen people explaining their work even even verbally like a transcript of a math lecture you try this oh I mess this up and what reinforcement learning is this R_L_V_R_ is very good at doing is amplifying these behaviours 'cause they're very useful in enabling the model to think longer and to check its work and I agree that it is very
Mm-hmm.
that this training kind of the model learns to amplify this in a way that is just so useful at the final answers being better.
I can give you also a hands-on example I was training the GRAND three base model with R_L_V_R_ on math five hundred the base model had an accuracy of about fifteen percent just fifty steps like in a few minutes with R_L_V_R_ the model went from fifteen percent to fifty percent accuracy and the mo you can't tell me it's learning anything about fundamentally about math in so many p
there's been two papers this year one of which I was on that talks about data contamination in QUIN and specifically that they train on a lot of this special mid-training phase that we just have like a minute on and it's weird so they train on problems that are almost identical to math.
Mm-hmm.
Mm-hmm.
where there's been multiple papers talking about contamination it's like how much can you believe them and I think this is what caused the reputation of R_L_V_R_ being about formatting because you can get these gains so quickly and therefore it must already be in the model. But there's a lot of complexity here that we it's not really like controlled experimentation so you don't really know.
uh if it were weren't true um, I would say distillation wouldn't work right, I mean distillation can work to some extent, but the the thing is th that is I think the biggest problem and and I'm researched this contamination because we don't know what's in the data, it's r unless you have b a new data set i it's really impossible. And uh the same uh you mentioned um math uh the math data set which is you have a question and an answer and an explanation is given. But then also even something simpler like uh M_M_L_U_ which is a multiple choice um benchmark, if you just change the format slightly um like
I don't know you, use uh pr a dot instead of a parenthesis or something like that. The model accuracy will vastly differ.
I think that that could be like a model issue rather than a general issue.
It's not even malicious by the developers of the N_M_ like hey we wanna cheat at that benchmark, it's just it has seen something at some point and I think the only fair way to evaluate an N_M_M_ is to have a new benchmark that is after the cut-off date when the N_M_M_ was deployed.
Can we lay out what would be the sort of the recipe of all those things that would be going to post-training and you mentioned R_R_L_V_R_ was a really exciting effective thing, maybe we should elaborate R_L_H_F_ still has a really important component to play. What kind of other ideas are there on post-training?
Mm-hmm.
Mm-hmm.
Mm-hmm.
Mm-hmm.
Mm-hmm.
Mm-hmm.
of all forming together, but to summarise it's like mid-training is give the model the skills it needs to then learn R_L_ and verify the rewards is let the model try a lot of time, so put a lot of compute into trial and error learning across hard problems and then R_L_H_F_ would be like finish the model make, it easy to use and kind of just round the model out.
Can you comment on the de-model compute required for R_L_V_R_?
It's only gotten up and up, so I think GRAC four w was famous for saying they use a similar amount of compute for pre-training and post-training. Back to the scaling discussion, they involve very different hardware for scaling. Pre-training is very compute bound, which is like this FLOPS discussion, which is just how many matrix multiplications can you get through one time. And because R_L_ you're generating these answers, you're trying the model in the real world environments, it ends up being much more memory bound because you're generating long sequences and the attention mechanisms have this
behaviour where you get a quadratic increase in memory as you're getting to longer sequences. So the compute becomes very different. So you when in pre-training we would talk about a model I think if we go back to like the by the administration executive order it's like ten to the twenty fifth flops to train a model. If you're using flops in post-training it's a lot weirder because the reality is just like how many hours are you allocating how many G_P_U_s for. And I think in terms of time the R_L_ compute is getting much closer because you just can't put it
into one system. Like pre-training is so computationally dense where all the G_P_Us are talking to each other and it's extremely efficient, where R_L_ has all these moving parts and it can just take a long time to generate a sequence of a hundred thousand tokens. It c like if you think about G_B_T_ five point two pro taking an hour, it's like what if your training run has a sample for an hour and you have to make it so that's handled efficiently. So I think in G_P_U_ hours or just like wall clock hours, the R_L_ runs are probably approaching the number of days as pre-training, but they probably aren't using as many
G_P_Us at the same time. There's rules of thumb where in labs it's like you don't want your pre-training runs to last more than like a month because they fail catastrophically. And if you were planning a huge cluster to be held for two months and then it fails on day fifty, the opportunity cost is just so big. So you kinda don't wanna just y d people don't wanna put all their eggs in one basket, which is like G_P_T_ four was like the ultimate yellow run and nobody ever wanted to do it before where it took like three months to train and everybody was shocked that it worked, where I think people are a little bit more
is an incremental now.
So R_L_V_R_ is more let's say un unlimited how much you can train or get still benefit where R_L_H_F_ because it's a preference tuning it you reach a certain point where it doesn't really make sense to spend more R_L_L_ budget on that. So just uh step back with um preference tuning. So there are multiple people that can give multiple let's say explanations for the same thing and they can both be correct, but at some point you learned a certain style and it doesn't make sense to you know iterate on it. My favourite example is like if r relatives ask
what laptop they should buy, I give them an explanation, ask them like yeah, what is your um use case, like they for example mm prioritize battery life and storage. Other people like us for example we would prioritize RAM and uh compute. And so but both both answers are correct, but different people require different answers. And with preference tuning, well you're trying to average somehow, like you are um asking the data labelers to give you the right well not the right the preferred answer, and then you train on that, but at some point yeah, you learn that average preferred answer in
uh and there's no I think reason to keep training longer on it because you know it's just a style where with R_L_V_R_ you literally give the model well you let the lo model solve more and more complex uh difficult problems and so I think that it makes more sense to allocate more budget long term to R_L_R_V_R_. And also that right now we are in L_L_R_V_R_ one point O_ land where it's still like that simple uh thing where we have a question and answer, but we don't do anything with the one
in between. So there was a I mean multiple research papers also by Google for example on process reward models that also give scores for the explanation how correct is the ex explanation and I think that will be the next thing, let's say R_L_V_R_ two point O_ for this year. Focusing in between question and answer like how to leverage that information, the explanation to improve the explanation and help it to get better accuracy. But then uh so that that's one angle. And there was a deep seek math
two paper where they also had interesting uh inference scaling there where well first they had um developed models that grade themselves a separate model and, I think that that will be one aspect and the other like Nathan mentioned it, will be for L_R_ we are branching into other domains.
the place where people are excited are value functions, which is ver pretty similar. So process reward models are kind of like process reward models assign how good something is to each kind of intermediate step in a reasoning process where value functions apply value to every token the language model generates. Both of these have been largely unproven in the language modelling in this reasoning model era. People are more optimistic about value functions forever for whatever reason now. I think process
word models were tried a lot more in this pre-O_ one, pre-reasoning model era and a lot of people had a lot of headaches with them. So I think a lot of it is the human nature of like value models have a very deep history in reinforcement learning. They're one of the first things that were core to like deep reinforced learning existing is like training value models in this. So right now the literature people are excited about trying value models, but there's very little proof in it. And there are negative examples in trying to scale up process word models. These things don't always hold in the
Mm-hmm.
Mm-hmm.
plot like this. But there's no scaling law for R_L_H_F_ where if you long increase the compute, you get some performance. In fact the seminal scaling paper for R_L_H_F_ is scaling laws for reward model over-optimization. So it's like that's a big line to draw with R_L_V_R_ and the methods we have now and in the future, like they will follow the scaling paradigm which is like the best runs you can let to run for an extra ten X_ and you get a few X_ performance, but you can't do this with R_L_H_F_. And that is just gonna be field defining and
people approach them, where I'm a shill for people academically to do R_L_H_F_. And that's a good way to describe it, is like to do the best R_L_H_F_ you might not need the extra ten or a hundred X_P_ compute, but to do the best R_L_V_R_ you do, so I think there's a what I say is a seminal paper from what was a meta-internship is call it's like the art of scaling reinforcer learning with language models, they're what they describe as a framework is scale R_L_ and their incremental experiment was like ten
and be two hundred hours, which is like thousands or tens of thousands of dollars per experiment, and they do a lot of them, which is just like this cost is not accessible to the average academic, which is a hard equilibrium where it's trying to figure out how to learn from each community.
I was wondering if we could take at this point a bit of a tangent and talk about education and learning. If you're somebody listening to this who's a smart person interested in programming interested in A_I_ so, I presume building something from scratch is a good beginning. So can you just take me through like what you would recommend people do?
I would personally start, like you said, uh implementing a simple model from scratch that you can run on your computer. The goal is not if you build a model from scratch to have like something you use every day for your personal projects, like it's not gonna be your personal assistant replacing an existing open-weight model or J_G_P_D_. It's to see what exactly goes into the L_M_M_, what exactly comes out of the L_M_M_, how the pre-training works in that sense uh on your own computer preferably, um and then of you learn about the pre-training, the supervised
tuning the, attention mechanism you, get a solid understanding of how things work. But at some point you will reach a limit, because small models can only do so much. And the problem with learning about L_L_M_s at scale is I would say it's exponentially more complex to make a larger model, because it's not that the model just becomes larger. You have to now think about sharding your parameters across multiple G_P_Us. You even for the K_V_ cache, there are multiple ways you can implement it. One is just to understand how it works, just to grow the cache. That's this
a cache you grow step by step by let's say concatenating lists um growing it, but then it wouldn't be optimal in G_P_U_s, you wouldn't do that. You would pre-allocate a tensor and then fill it in. But that adds again another twenty thirty line lines of code, and for each thing you add so much code. And I think the trick with a book is basically to understand how the L_L_M_ works. It's not gonna be your production level L_L_M_, but once you have that, you can understand the production level L_L_M_.
So you're trying to always build an L_L_M_ that's gonna fit on one G_P_U_?
Yes, the most of them I have they've I have some bonus materials on some uh uh M_O_E_ models. I think one or t or two of them they may require multiple G_P_Us, but the goal is to have it on one G_P_U_. And the beautiful thing is also you can self-verify. It's almost like R_L_V_R_ when you code these from scratch. You can take uh an existing model from the hugging phase transformer library Um. so the hugging phase tranf former library is great, but if you wanna learn about L_M_S_, I think that's not the best place to start because the code is so complex because it has so
it has to fit so many use cases. Also some people use it in production. It has to be really sophisticated and it's really intertwined and really hard. It's not linear to read.
It was started as a fine tuning library. And then it grew to be like the standard representation of every model architecture and the way it is loaded. So hugging phase is like the default place to get a model and, transformers is the software that enables it, so people can easily load a model and do something basic with it.
And all frontier labs that have open weight models have a hugging phase transformers version of it, like from deep seek to G_P_T_O_S_S_. That's like the canonical weight uh that you can load there. But again also even transformers, the library is not used in production. People use then S_G_ lang or V_L_L_M_ and it's adds another layer of complexity.
We should say that the transformers library has like four hundred models.
So it's a one library that tries to implement a lot of L_L_M_s and so you have a huge code base basically, it's like huge, it's like uh i it's I dunno, maybe millions of thousands of lines of code and the find it's like understanding the part that you wanna understand is finding the needle in the haystack. Well what's beautiful about it is uh you have a working implementation and so you can work backwards from it. What I w m recommend doing or what I also do is if I wanna understand for example how almost three is implemented, I would uh look
the weights in the model hub, the config file, and then you can see oh they use so many layers, they use let's say group query attention or multi-head attention in that case, then you see all the components in like a human readable, I dunno, hundred lines of config file, and then you start let's say with your G_P_D_ two model and add these things, you know. And the cool thing here is you can then lot the pre-trained weights and see if they work in your model. And you wanna match the same output that you get with a transformer model and then you can use that as a basically as a verifiable
to make your architecture correct. And then it's kind of sometimes it takes me a day to with almost three the the challenge was uh rope for the position embeddings. They had a yarn uh extension and there was some custom uh uh scaling there and I couldn't quite match the these things. And in this struggle you kind of understand things, but the cool thing is at the end you know you have it correct because you can unit test it, you can check against the reference implementation. And I think that's maybe one of the best ways to learn really like
basically reverse engineer something. Yep.
I think that that is something that everybody that's interested in getting to A_I_ today should do. And I think that's why I liked your book is like I'd came to language models from this R_L_ and robotics field. Like I'd never had taken the time to just like learn all the fundamentals and this transformer architecture I described as being like so fundamental as like deep learning was a thing that I had to learn in the past and people need to do this. And I think that where a lot of people kind of get overwhelmed is how do I apply this to have
packed or find like a career path because like A_I_ and language models make this fundamental stuff so accessible and people with motivation won't learn it. And then it's like how do I get the cycles on goal to contribute to research. And I think that I'm actually fairly optimistic in this because the field moves so fast that a lot of times the best people like don't fully solve a problem because there's a bigger lower ha like a m bigger problem to solve that's very low hanging fruit so they move on. And I think that a lot of what I was trying to do in the
early Jeff book is like take post-training techniques and just describe how people think about them influencing the model and what people are doing and then it's remarkable how many things I just think are just like people stop studying them or don't. So I think people trying to get narrow after doing the fundamentals is good and then reading the r relevant papers and being engaged in the ecosystem it's like you actually the proximity that random people have online from the leading researchers
like th no one knows who all the th the anonymous account on X_ and M_L_ is very popular for whatever reason. And no one knows who all these people are. Like it could just be random people that study the stuff deeply. Especially with the A_I_ tools and just be like ke I don't understand this, keep digging into it. I think it's a very useful thing. But there's a lot of research areas that just like are maybe three papers that you need to read. And then one of the authors will probably email you back. But you have to put in a lot of effort into these emails to understand the field. Like I think it would be
for a newcomer easily weeks of work to feel like they can truly grasp like what is a very narrow ari area but, I think going narrow after you have the fundamentals would be very useful to people because it's like I've became very interested in like character training which is like how you make the model funny or sarcastic or serious and like what do you do to the data to do this and it's like a student at Oxford reached out to me and it's like hey I'm interested in this and I advised him and I was like that paper now exists and it's like I don't
there's like two or three people in the world that were very interested in this. He's a P_H_D_ student which gives you an advantage, but like for me that was a topic I was waiting for someone to be like hey I've time to spend cycles on this and I'm sure there's a lot more very narrow things where you're just like oh it doesn't make sense that there was no answer to this and I think that it's just like there's so much information coming that people are like I can't grab onto any of these but if you just actually stick in an area I think there's a lot of interesting things to learn.
Yeah I think you can't d try to do it all because it would be very overwhelming and you would burn out if you try to keep up with everything. For me uh for example I haven't kept up with computer vision in a long time uh just focused on L_M_s. But coming back to your book for example I think this is also a g a really great book and a g really good bang for the buck because you wanna learn about R_L_H_F_ I wouldn't go out there and read R_L_H_F_ papers because I would be you would be spending two years. Yeah. Yeah. Yeah.
And we'll see what comes out to be true.
wha what are some of the just to go through some of the table of contents, some of the ideas we might have missed in the bigger picture of the post-training. So first of all you do the problem setup, training overview, what are preferences preferences, data, and the optimisation tools, reward modelling, regularisation, instruction tuning, rejection sampling, reinforcement learning, I_E_ policy gradients, direct alignment algorithms, then constitutional A_I_ and A_I_ feedback, reasoning and inference time scaling, tool use and function calling, synthetic data and distillation,
evaluation, and then open questions section over optimization style and information, and then product, U_X_, character, and post-training. So what are some ideas worth mentioning that connect both the educational component and the research component? You mentioned the character training. Which is pretty interesting.
character training is interesting 'cause there's so little out of it, but we talk about how people engage with these models and like look we feel good using them 'cause they're positive, but that can go too far, it could be too positive and it's like essentially it's how do you change your data and or decision making to make it exactly what you want. And I open A_I_ has this thing called a model spec which is essentially their internal guideline for what they want the model to do and they publish this to developers, so essentially you can know what is a
of opening eyes training, which is like they have the intentions and they haven't met it yet, versus what is something that they like actually wanted to do and that you don't like. And that transparency is very nice, but all the methods for curating these documents and how easy it is to follow them is not very well known. I think the way the book is designed is that the reinforced learning chapter is obviously what people want, because everybody hears about it with R_L_V_R_. And it's the same algorithm, it's in the same map, but it's just like you can use it in in very different documents. So I think the core pref of R_L_H_
is like how messy preferences are, is essentially rehash of a paper I wrote years ago. But this is essentially the chapter that'll tell you why R_L_H_F_ is never ever fully solvable, because like the way that even R_L_ is set up is that um it assumes that preferences can be quantified and that multiple preferences can be reduced to single values. And I think it relates in the economics literature to the von Neumann-Morgenstern
And like that is the chapter where all of that philosophical, economic, and like psychological context, it tells you what gets compressed into doing R_L_H_F_. So it's like you have all of this and then at the and later in the book it's like you use this R_L_ map to make the number go up. And I think that that's why I think it would be very rewarding for people to do research on is because it's like quantifying preferences is something that is just like humans have designed the problem in order to make preferences studyable. But there's kind of fundamental debates on like an
example is in a language model response you have different things you care about, whether it's accuracy or in style, and when you're collecting the data they all get compressed into like a I like this more than another. And it's like like that is happening and there's a lot of philosophical there's a lot of research in other areas of the world that go into like how should you actually do this. I think social choice theory is the subfield of economics around how you should aggregate preferences. And there's like uh I was I went to a workshop that published a white paper and I'm like
can you think about using social choice theory for R_L_H_F_? So I mostly would want people that get excited about the math to come and have things that they could stumble into and learn this kind of broader context. I think there's a fun thing I just keep a list of all the tech reports that I like of reasoning models. So in the in chapter fourteen where there's a kind of like a short summary of R_L_V_R_ there's just like a gigantic table where I just like list every single reasoning model that I like. So there's just like I think in education a lot of it needs to be like at this point it's like what I like
because the language models are so good at the math where it's like famous paper direct preference optimisation which is like a much simpler way of pro solving the problem than R_L_ um the derivations and the appendix skip steps of math and it's like I tried for this book like I redid the derivations and I'm like what the heck is this log trick that they use to change the math but doing it with language models they're like this is the log trick and I'm like I don't know if I like this that the math is so commoditized I think like some of the struggle in reading the
this appendix and following the math I think is good for learning, and I d it's uh
both. Some of the providers are starting to work on models for education which are designed to not give actually I haven't used them, but I would guess they're designed to not give all the information at once and make people work to do this. So I think you could train models to do this, and it would be a wonderful contribution, where like all of this stuff in the book, you have to reevaluate every decision for it, which is such a great example. I th I think there's there's a chance we work on it at A_I_ too, which I w which I was like oh, I think this is be so cool.
Mm-hmm.
Mm-hmm.
fully probate, but the problem here is I think it requires discipline, and a lot of people do math for like I mean there are a lot of people who enjoy math, but there are also a lot of people who need to do it for their homework, and then it's like the shortcut. And yeah, we can then develop an educational L_L_M_, but the other L_L_M_ is still there, and there's still a temptation to use the other L_L_M_s.
They they understand the stuff they're passionate about, they're self-aware about it, and they understand it shouldn't be easy. Like I think we just have to develop a good taste. We talk about research taste, like school taste, about stuff that you should be struggling on and and stuff you shouldn't be struggling on, which is tricky to know 'cause sometimes you don't have good um long-term vision about what would be actually useful to you in your career. But you have to you have to be develop that taste, and yeah.
I was talking to maybe my fiance or friends about this, and it's like there's this brief ten year window where all of the homework and all the exams could be digital, but before that everybody had to do all the exams in Bluebook 'cause there was another way, and now after A_I_ everybody's gonna need to be in Bluebooks and oral exams 'cause everybody could cheat so easily. It's like this brief generation that had a different uh education system that c like everything could be digital and but you still couldn't cheat, and now it's just gonna go back, but it's just very funny.
You mentioned character training, just zooming out on on a more general topic. For that topic, how much compute was required, and in general to contribute as a researcher, are there places where not too much compute is required where you can actually contribute as an individual researcher?
For on the character training thing, I think this research is built on fine tuning about seven billion parameter models with LoRa, which is like a essentially your only fine tune, a small subset of the weights of the model. I don't know exactly how many G_P_U_ hours that would take.
But it's doable.
not doable for every academic. So the situation for some academics is like so dire that the only work you can do is doing inference where you have closed models or open models and you get completions from them and you can look at them and understand the models. And that's very well suited to evaluation, which you become ex you want to be the best at creating representative problems that the models fail on or show certain abilities, which I think that you can break through with this. So I've mm like I think that the top-end goal for a researcher working on
if you want to have career momentum is the frontier labs pick up your evaluation. So it's like you don't need to have every project do this. But if you go from a small university with no compute and you figure out something that Claude struggles with and then the next Claude model has it in the blog post, like there there's your career rocket ship. I think that that's hard, but it's like if you wanna scope the maximum possible impact with minimum compute, it's something like that, which is just get very narrow and it takes learning of where the models are going. So you need to like
build a tool that tests where not clod four point five will fail. If you're gonna do a rea if I'm gonna start a research project, I need to think where the models in eight months are gonna be struggling.
But what about developing totally novel ideas?
this is a trade-off. I think that if you're doing a P_H_D_ you could also be like it's too risky to work in language models, I'm going way longer term, which is like what is what is the thing that's gonna define language model development in ten years. Which I think that I end up being a person that's pretty practical. I mean I went to my P_H_D_ where it's like oh I got into Berkeley, worst case I get a master's, and then I go work in tech. And so like I'm very practical about it, so I'm like the life afforded to people to work at these A_I_ companies, the amount of
Like OpenAI's average compensation is over a million dollars in stock a year for employee. Any normal person in the U_S_ to get into this A_I_ lab is transformative for your life. So I'm pretty practical. I was like there's still a lot of upward limility working in language models if you're focused. And the outcomes is like look at these jobs. But from a research perspective the transformative impact and these academic awards of the like be the next Jan LeCun is from not working on not caring about language model development very much.
So I'd get to work with some awesome students and they're like should I go work in an A_I_ lab and I'm like uh like you're getting a P_H_D_ at a top school or you're gonna leave to go to a lab I'm like I don't know like if you go work at a top lab I don't blame you. Don't go work at some random start-up that might go to zero. But if you're going to open A_I_ I'm like it could be worth leaving a P_H_D_ for.
Let's more rigorously think through this. Where would you give a recommendation for people to do a research contribution? So the options are academia, so get get a P_H_D_, spend five years publishing, c compute resources are constrained. There's uh
there's research labs that are more focused on open weight models and so working there. Or closed frontier labs, research labs. Open A_I_ and Tropic, X_A_I_, so on.
Mm-hmm.
Mm-hmm.
Mm-hmm.
and trade-offs that in my opinion favour just like take the t take the well-paying job with meaningful impact. So it's like not only so that you're getting paid to sit around at open A_I_ you're, building like the cutting edge of things that are changing millions of people's relationship to tech.
But it's you're a cognitive machine.
I think it's b honestly it hasn't changed that much. Uh so I I have been in e academia. I'm not in academia anymore. At the same time I wouldn't wanna miss my time in academia, but what I wanted to say before I get to that part, I think it hasn't changed that much. I um was working in um like I was using A_I_ um machine learning methods um for applications and computational biology with collaborators and a lot of people went from academia directly to Google and uh I think it's the same thing back
professors were like, you know, sad that their students went into uh industry wi because they couldn't carry on their legacy in that sense. And I I think it's the same thing. I mean it's like it it hasn't changed I think that much. The only thing that has changed is the scale. But you know, cool stuff was always developed in industry that was closed. You could couldn't talk about it. And I think the difference now is um well, your preference, do you like to talk about your work publish, uh or you know, you're you are more in a closed lab?
uh the al the that's one difference, the compensation of course, but it's always been like that, I think. So it really depends on, you know, where you feel comfortable, and it's also nothing is forever. The only thing right now is there's a third option, which is um starting a start-up. That's a lot of uh people doing start-ups, very risky move, uh but can be high is a high risk, high reward type of situ situation where uh joining an industry lab I think is pretty safe uh, you know, also upward m mobility i uh honestly I
if once you have been at a industry lab, it will be easier to find future jobs. But then again, you know, it's like yeah, how much do you enjoy the team and working on propriety things, versus how do you like the publishing work. I mean publishing is stressful, it is um you know like uh acceptance rate at conferences can be arbitrary, can be very frustrating, but also high reward if you have a paper plow to publish, you feel good because your name is on there, you have a high
And, you know, feel like my friends who are professors seem on average happier than my friends who work at a frontier lab, to be totally honest. 'Cause that's just grounding and the frontier labs definitely do this nine nine six, which essentially is shorthand for work all the time.
Can you describe nine nine six as culture that's, I believe you could say invented in China and uh adopted in Silicon Valley. What's what's nine nine six? It's nine A_M_ to nine P_M_ and six days a week What. is that seventy, two hours? Okay. So what is this basically the standard in A_I_ companies in Silicon Valley? More and more this kind of grind mindset?
Exactly like that. But I think there is a trend towards it. And it's interesting, I think it almost flipped because when I was in academia, I felt like that because uh as a professor you had to write grants, you had to do t uh you had to teach and you had to do your research. It's like three jobs in one. And it is more than a full time job if you wanna be successful. And um I feel like now like Nathan just said, the professors i in comparison to a lab, I think they have less like even maybe pressure or workload than at a a frontier lab because
they work a lot, they're just so fulfilled. But like working with students and having a constant runway of mentorship and like a mission that is very people-oriented, I think in a aero when things are moving very fast and very chaotic is very rewarding to people.
have to make it and it's like it is really important that people put in the time. But well it is really hard because you have to deliver constantly. And I've been at a start-up, I had a good time, but I don't know if I could do it forever. It's like a interesting pace uh and is exactly like we talked about in the beginning. Uh these models are leapfrogging each other and they are ju just constantly like trying to take the next step compared to the competitors. It's just ruthless I think right now.
I think this leapfrogging nature and having multiple players is actually an underrated driver of language modelling process where competition is so deeply ingrained to people and these companies have intentionally created very strong culture like, anthropic is known to be so culturally like deeply committed and organised. I mean like we hear so little from them and everybody that's anthropic seems very aligned and it's like being at a culture that is super
tight and having this competitive dynamic is like talk about a thing that's gonna make you work hard and create things that are better. So I think that this that that comes at the cost of human capital which is like you can only do this for so long and people are definitely burning out. I think I've I wrote a post on burnout as I like I've tried in and out of this myself especially trying to like be a manager of full mode training. It's a crazy job doing this. The book Apple in China by Patrick McGee he talked about the how hard the
Apple engineers worked to set up the supply chains in China, and he was like they had saving marriage programmes, and he told in a d a podcast here it was like m people died from this level of working hard. So I think that it's just like it's a perfect environment for creating progress based on human expense, and I d it's there's gonna be a lot there's a lot of the human expense is the nine nine six that we started this with, which is like people do really grind.
I'll also write this book I. think they had a code word for if someone had to go home to spend time with their family to save the marriage and uh it's crazy then and colleagues uh understand okay this is like r read alert for this situation we have to let that person go home this weekend and um but at the same time I don't think they were forced to work it's really they were so passionate about the product I guess that it is it is you you get into that mindset and I I had that sometimes as an academic but also as an independent person I have that sometimes I overwork and it's unhealthy I had you know I had
issues I had neck issues because I did not take the breaks that I maybe should have taken but, it's not because no one forced me to it's because I wanted to work because it's exciting stuff Yeah.
I have this great fortune of having conversations with wide variety of human beings and from there I get to see all these bubbles and echo chambers across the world and it's fascinating to see how we humans form them and I think it's fair to say that Silicon Valley is a kind of echo chamber uh a kind of um silo and bubble. I think bubbles are actually really useful and effective. It's not necessarily a negative thing 'cause it could be ultra-productive, it could be the
the Steve Jobs reality distortion field 'cause you just convince each other the breakthroughs are imminent and by convincing each other of that you make the breakthroughs imminent.
Mm-hmm.
Bern Hooper wrote a book classifying bubbles, but essentially one of them is financial bubbles which is like speculation which is bad and the other one is like I don't know the term but effectively for build outs because it pushes people to build these things and I do think A_I_ is in this, but I worry about it transitioning to a financial bubble which is like it's
Yeah, but also in the space of ideas, that bubble, you are doing a reality distortion field, and that means you are deviating from reality. And if you go too far from reality, while also working, you know, nine nine six, and y you might miss some fundamental aspects of the human experience, including in Silicon Valley, and this is a common problem in Silicon Valley, is like is a very specific geographic area, you might not understand the Midwest perspective
the full experience of all the other different humans in the United States and across the world, and you and you speak a certain way to each other, you convince each other of a certain thing, and that that can get you into r real trouble. Whether A_I_ is a big success and becomes a powerful technology, or it's not in either trajectory, you can get yourself into trouble. So you have to consider all of that. Here you are, a young person trying to decide what you wanna do with your life.
The thing that is I don't even really understand this, but the S_F_A_I_ memes have gotten to the point where permanent underclass was one of them, which was the idea that the last six months of twenty twenty five was the only time to build a durable value in A_I_ startup or model. Otherwise all the value will be captured by existing companies and you will therefore be poor. Which like that's an example of the S_F_ thing that goes so far. I still think for young people that going to be able to tap into it if you're
really passionate about wanting to have a impact in A_I_. Like being physically an S_F_ is the most likely place for you going to do this, but it has it has trade-offs.
I think S_F_ is an incredible place, but there is a bit of a bubble. And if you go into that bubble, which is extremely valuable, just get out also, read history books, read literature, uh visit other places in the world, Twitter is not and sub-stack is not the entire world.
I think I would say one of my one people I worked with is moving to S_F_ and it's like I need to get 'em a copy of the season of The Witch, which is a history of S_F_ from like nineteen sixty to nineteen eighty five which goes through like the hippie rev revolution, like they all the um gaze kind of taking over the city and that culture emerging and then the H_I_B_ AIDS crisis and other things and it's just like that is so recent and so much turmoil and hurt but also like love and S_F_ and it's like no one knows about this the, great
season of The Witch I recommend it. A bunch of my S_F_ friends were who do get out recommended it to me, and I think that it's just like living there like I lived there and I didn't appreciate this context and it's just like so recent.
Yeah. Okay let's uh we talked a lot about a we talked a lot about a lot of things. Um certainly about the things that were exciting last year, but this year uh one of these you guys mentioned is exciting is uh the scaling of text diffusion models and it's just a different exploration of text diffusion. Can you talk about what that is and what the possibility holds? Sort of different kinds of approaches than the current L_M_s.
about the transformer architecture and the autoregressive transformer architecture specifically like G_P_T_ and it doesn't mean no one else is working on anything else, so people are always on the let's say look out for the next big thing because I think it would be almost like um yeah stupid not to because sure m right now the transformer architecture is the thing and it works best and there's right now nothing else other but you know it's always a good idea to not put all your eggs into one basket so people are developing other things uh alternatives to the
um autoregressive uh transformer. One of them would be for example text diffusion models and listeners may know diffusion models from the image generation, like uh u stable diffusion popularised it. There was like a paper on generating images. Back then people used GANs, the generative adversarial networks. And then there was this diffusion process where you iteratively denoise an image and that m resulted in really good quality images over time. Stable diffusion was a company. Other companies built their own diffusion models and then people are now like okay can we try
this also for text. Doesn't, you know, make intuitive sense yet because it feels like okay uh it's not something continuous like a pixel that we can differentiate, it's like a discrete text, so how do we im implement that denoising process. But uh it's kind of like similar to uh the BERT models by Google, like when you go back to the original transformer uh so there were like the encoder and the decoder. The decoder is what we are using right now in in G_P_T_ and so forth. The encoder i it's more like um a parallel let's say technique where you have
multiple tokens that you fill in in parallel. Instead of so G_P_T_ models they do autoregressive one token at a time. You c you complete the sentence one token at a time. And in BERT models you have a text that say sentence that has uh gaps. Uh you like mask them out. And then one iteration is filling in these gaps. And text diffusion is kind of like that where you are starting with let's say some random text. And then you are filling in the missing parts or you are refining them iteratively and you have multiple iterations. And the cool
thing here is that this can do multiple tokens at the same time. So it's kind of like the promise of having it more efficient. Now the the trade-off is of course well how good is the quality, it might be faster. Uh and then now you have this dimension of the denoising process, the more steps you do the better the text becomes. Um and people you know n I mean you you can scale in different ways, they try to see if that is maybe a valid alternative to the autoregressive model in terms of giving you
the same quality for less compute. Uh right now I think it's you know there are papers that suggest okay if you wanna get the same quality you uh have to crank up the n the noising steps and then you end up spending the same compute you would spend on an autoregressive model. Um the other downside is well it's parallel which sounds appealing, but some tasks are not parallel. Like you know like reasoning task tool use maybe where you have to ask an in quote interpreter to give you an intermediate result and that is kind of tricky with diffusion models, so there are some hybrids, but
the main idea is how can we parallelize it? And so interesting avenue, I think right now there are mostly research uh let's say models out there like Lada and some other ones. I I saw some by start-up some deployed models. There is no big uh diffusion model at scale yet, like you know like Gemini chat G_P_D_ scale uh in that level. But there was an announcement by Google like uh a site where they said they are launching Gemini diffusion and they put it into context of their I think Nano two
And that they said basically for the same quality on most benchmarks we can generate things much faster. So you m uh mentioned what's next. I don't think the text diffusion model is gonna replace autoregressive algorithms, but it will be something maybe for quick uh cheap at scale tasks. Maybe the free tier in future will be something like that.
think there's a couple examples where it's I I've heard that it's actually been started to be used. I think to paint an example of why this is so much better, for example when G_P_T_ five is taking thirty minutes to respond is generating one token at a time, and this diffusion idea is essentially generate all of those completion all of those tokens in the completion in one batch, which is why it could be way faster. And I think it could be suited the start-ups I'm hearing are like codes start-ups where you have a code base and you have somebody that's effectively vibe coding and they say
this change. And a code diff is essentially a huge reply from the model, but it doesn't have to have that much external context, and you can get it really fast by using these diffusion models. So that's what I've heard of one example is that they use these text diffusion to generate really long diffs, because doing it with a autoregressive model would take minutes, and that time for like a user-facing product causes a lot of churn. So like every second you lose a lot of users. So I th I think that's gonna be this thing where it's gonna grow and have some applications. But I
actually thought that different types of models were gonna be used for different things more sooner sooner than they have been. So I had a kind of trade-off. I think that the tool use point is the one that's stopping them from being um like most general purpose 'cause like cloud code and this ch had to be played with search. Like the re the autoregressive chain is interrupted with some external tool and I I don't know how to do that with the diffusion set-up.
So what's the future of uh tool use this year and in the coming years? Do you think there's gonna be a lot of de developments there? How that's integrated to the entire stack?
I do think right now I mean it's mostly on the proprietary L_M_M_ side um but I think we will see more of that in the open source tooling and I think I mean it is a huge unlock because then you can really outsource certain tasks from just memorization to actual m you know like instead of m having the L_M_M_ memorize what is twenty three plus five, just use a calculator.
So do you think that can s help solve hallucination?
uh not solve it, but reduce it. Um so so the L_M_M_ needs to know what um like when to ask for a tool call. And the second one is well it doesn't mean the internet is always correct. You can do a web search, but let's say um I asked who won the World Cup in let's say nineteen ninety eight. It still needs to find the right web site and get the right information. So you c you can still go to the incorrect web site and give me incorrect information. So I don't think it will fully solve that, but it is improving it in that sense. And um so
another cool paper earlier this year, I I think was gen uh December thirty first. So it's not technically twenty twenty six, but close. So th like the recursive uh language model. That that's a cool idea to kind of take this even a bit further. So just to explain uh so Nathan, you also mentioned earlier, it's harder to do cool research in academia because of the compute budget. If I recall correctly, they did everything with G_P_D_ five. So they didn't even use local models. But the idea is let's say if we non-context
instead of having the L_L_M_ solve all of it in like one shot or even like in a chain, you break it down into sub-tasks. You have the L_L_M_ decide when like what is a good let's say sub-task and then recursively call an L_L_M_ to solve that and I think something like that also then adding tools and you know each one maybe you have like a huge Q_ and A_ task so each one goes to the web and gathers information and then you pull it at the end together and stitch it back together um like where uh I think there's gonna be a lot
of unlock using things like that where you ne not necessarily improve the L_L_M_ itself, you improve how the L_L_M_ is used and what the L_L_M_ can use. One downside right now with tool use is you have to give the L_L_M_ permission to use tools and uh that will take some trust, especially if you wanna unlock uh things like having an L_L_M_ answer emails for you. Or not even answer, but just sort them for you or select them for you or something like that. I don't know if I would today give an L_L_M_ access to my emails right, I mean as
like a huge risk.
I thi I think there's a cool one last point on the tool use thing. I think that you hinted at this and we've both come at this in our own ways is that the open versus closed models use tools in very different ways where open models people go to hug and face and you download the model and then the person's gonna be like oh what tool do I want and I dunno EXA is my search preferred search provider but somebody else might care for a different search start-up where you release a model it needs to be useful for multiple tools for multiple use cases which is really hard 'cause you're making like a general
engine model, which is actually what G_P_T_O_S_S_ is good for. But on the closed models, you're deeply integrating the specific tool into your experience. And I think that open models will struggle to replicate some of the things that I like to do with closed models, which will be like, I don't know, you can reference a mix of public and private information and something that I keep trying every three to six months. I try like codex on the web, which is just prompting a model to make an update to some GitHub repository that I have. And it's just like like
that sort of secure cloud environment is just so nice for just like send it off and do this thing and then come back to me. And these will probably help define some of the local open and closed niches, but I think initially 'cause there was such a rush to get these tool use working that the open models were on the back foot. Which is kind of inevitable, I think there's so much resor so many resources in these frontier labs, but will be fun when the open models solve this because it's gonna
like a bit more flexible and potentially interesting model that might work with this recursive idea to like be an orchestrator and a tool used model. So hopefully the necessity drives some interesting innovation there.
So continual learning uh this is a long-standing topic important, problem. I think that increases in importance as the cost of training of the models goes up. So can you explain what continual learning is and how important it might be this year and in the coming years to make progress?
This relates a lot to this kind of S_F_ get zeitgeist of what is A_G_I_, what is which is artificial general intelligence, and what is A_S_I_ artificial super-intelligence, and what are the language models that we have today capable of doing. I think the language models can solve a lot of tasks, but a key milestone among the A_I_ community is essentially when A_I_ could replace any remote worker taking in information and solving digital tasks and doing them. And I th the limitation that's highlighted by people is that
a language model will not learn from feedback the same way that an employee is. So if you hire an editor, the editor will mess up, but you will tell them, and if you hired a good editor, they don't do it again. But language models don't have this ability to modify themselves and learn very quickly. So the idea is if we're gonna actually get to something that is a true like general adaptable intelligence that can go into any remote work scenario, it needs to be able to learn quickly from feedback and on-job learning.
Mm-hmm.
I'm personally more bullish on language models by being able to just provide them with very good context. You said like you maybe off-line said that like you can write extensive documents to models where you say I have all this information, here's all the blog posts I've ever written, I like this type of writing, my voice is based on this, but a lot of people don't provide this to models and the models weren't designed to like take this m amount of context previously, like the agentic models are just starting. So it's this kind of trade-off of do we need to
the weights of this model with this continual learning thing to make them learn fast. Or the counter-argument is we just need to provide them with more context and information and they will have the appearance of learning fast by just having a lot of context and being very smart.
So w we should mention the terminology here. So continual learning refers to changing the weights continuously so that the model adapts, adjusts based on the new the new incoming information, does so continually and rapidly and frequently and so on. Uh and then the thing you mention on the other side of it is d generally will be referred to as in-context learning. As you learn stuff there's a huge context window. You can just keep loading
was extra information every time you prompt a system, which I think both are legitimately can be seen as a learning. It's just a different place where you're doing the learning.
of weights we already have that in different flavours. I mean if you think about how uh so th I think the um the th distinction here is do you do that on a personalised custom model for each person or do you do it on a global model scale. And I think we have that already with uh going from G_P_D_ five to five point one and five point two. It's maybe not immediate, but it is like a curated update, a quick curated update where uh there was feedback by the you know, things it couldn't do, feedback by the community, they updated the weights next model and
so forth. So it is I mean kind of like a flavour of that. Um other even finer-grained example a finer-grained example is like R_R_L_V_R_ you run it, it updates. The problem is you can't just do that for each person because it would be too expensive to update the weights for each person. And and I think that's the problem. So unless you get I th I mean even at open AI scale building b data centres it would be too expensive I think. That is only feasible once you have something on the device where the cost is on the
like what Apple tried to do with the Apple Foundation models, putting them on the phone, and then they learn from the experience.
Uh a bit of a related topic, but uh this kind of um uh maybe anthropomorphize term, but memory. What are the different ideas of the mechanism of how to add memory to these systems? Is your increasing seeing soul? So personalised memory especially.
it's mostly like uh context basically, s stuffing um things into the context and then just recalling that. But again, like it is I think well, it's expensive because you have to like I mean you can cash it, but still you spend tokens on that. And the second one is you can only do so much. I think it's more like a preference and or a style. I mean a lot of people do that when they solve math problems. You say uh there's ways that you can add previous knowledge and stuff, but you also give it certain preference
pumps do what I preferred last time, whatever, like something like that. But it does it doesn't unlock uh new capabilities. So for that uh w one thing people do use still is LoRa uh LoRa adapters. These are basically instead of updating the whole weight matrix, there are two smaller weight matrices um that you kind of have in parallel or overlay. It's like the delta. But um yeah, you you can do that to some extent, but then again it is economics. You c so there were also papers for
Laura learns less, but forgets less. It's like you know, it's no free lunch. If you wanna learn more, you need to n use more weights, but it gets more expensive. And then again, if you learn more, you forget more. And i it's like you have to find that Goldilocks zone basically.
We haven't really mentioned it much, but uh implied in this discussion is context length also Is. there a lot of innovations that's possible there?
I think the colloquially accepted thing is that it's a compute and data problem, where you can and sometimes like small architecture things which are like attention variance. So if you have we talked about like hybrid attention models, which is essentially if you have what looks like a state space model within your transformer, and like those are better suited because you have to c spend less compute to model the f furthest along token. And I think that but those aren't free 'cause they have to be a
need by a lot of compute or um the right data. So how many sequences of a hundred thousand tokens do you have in the world and where do you get these? And I think it just ends up being pretty expensive to scale them. So we've like gotten to pretty quickly to like a million tokens of input context length. And I would expect it to keep increasing and like get to like two million or five million this year. But I don't expect it to go to like a hundred million. That would be like a true breakthrough. And I think those breakthroughs are possible, like the continual learning thing
think of it as a research problem where you could there could be a breakthrough that just makes transformers work way better at this and it's cheap. Like these things could happen with so much scientific attention but turning the crank it'll be s consistent mm increases over time.
I think also looking at the extremes I, think there's again no free lunch. So uh the one extreme to make it cheap, you have a let's say an R_ and N_ that has a single spa a state where you save everything from the previous stu it's like a s uh specific um fixed size thing. So you never really g grow the memory because it's you are stuffing everything into one state. But then the longer the context gets, the more information you forget, because you can't you can't kip I mean compress everything into one state. Then on the other end you have the transformers which try to remember every token.
which is great sometimes which we wanna look up specific information but very expensive because you have the K_V_ cache that grows, the um dot product that grows, but then yeah like you said the MAMBA layers I mean they kind of have the same problem I would say like an R_ and N_ you try to compress everything into one state, you're a bit more selective there, but then I think it's like this Goldilocks zone again with Unimotron three they found like a good ratio of how many attention layers do you need for the global information where everything is accessible compared to having these compressed states.
I I think that's how y I think we will scale more by finding better let's say ratios in Goldilocks zone, like between um like compute uh making it cheap enough to run, but then also making it powerful enough to be useful. And one more plug here, um the rec recursive language model paper, that is one of the papers that tries to kind of address the long context thing. So what they found is essentially instead of stuffing everything into this long context um if you break it up into these
smaller um multiple smaller tasks, so you save memory by having multiple smaller calls. You can get actually better accuracy than having the L_M_ try everything all at once. I mean it's a new paradigm. We will see, you know, there might be other flavours of that. So I think with that we will still make improvement on wrong context, but then also like Nathan said, I think the problem is for pre-training uh itself we don't have as many wrong context documents as other con uh documents, so it it's harder to study uh uh
how L_M_s behave and stuff like that on on that level like.
there are some rules of thumb where essentially you pre-train a language model. Like although we pre-trained at like eight K_ context length and then extended to thirty two K_ with training. And there are some rules of thumb where you're just like essentially doubling the training context length takes like two X_ compute and then you can normally like two to four X_ the context length again. So I think a lot of it ends up being kind of compute bound at pre-training which is in this link we talked about this, everyone talks about this big increase in compute for the top labs this year and that should reflect in some longer
but uh I think on the post-training side there's some more interesting things, which is as we have agents, the agents are gonna manage this context on their own, where now people that use CLOD code a lot dread the compaction, which is when CLOD takes its entire full a hundred thousand tokens of work and compacts it into bulleted list. But what the next models will do, I'm just not a novel I'm sure people are already working on this, is essentially the model can control when it compacts and how. So you can essentially like train your R_L_ algorithm where compaction is an action, where it short
the history and then the problem formulation will be I want to keep the maximum evaluation scores that I have gotten while the model compacts its history to the minimum length, because then you have the minimum amount of tokens that you need to do this kind of compounding auto-regressive prediction. So there's actually a pretty nice problem set-ups in this where the b like these agentic models learn to use their context in a different way than just plow forward.
One interesting also recent example would be deep seek version three point two where they had like the sparse attention mechanism where they have essentially like a s very efficient small lightweight indexer and instead of attending to all the tokens it selects okay what tokens do I actually need. It's I mean it's uh it almost comes back to the original idea of attention where you are selective, but attention is always on you, have maybe zero weight on some of them, but you use them all, but they are even more like okay let's just mask that out or like not even co do that. And even with
um uh sliding window attention, almost. That is also kind of like that idea. You have that rolling window where you keep it fixed 'cause you don't need everything all the time. You o occasionally some layers you might, but i it's wasteful. But right now I think yeah, if you use everything you're on the safe side, it gives you the best bang for the buck because you never miss inf information. And right now I think this year will be more also the year figuring out, like you said, how to be more smart about that. I I think right now people wanna have the next state of the art, and the state of the art is uh happens to be the
brute force expensive thing. And then once you have that, like you said, keep that uh accuracy but, let's see how we can do that cheaper now, like tricks uh you know.
Yeah, all this scaling thing. Like the reason we get the Claude four point five sonnet model first is because it you can train it faster and you're not hitting these compute walls as soon and they can just try a lot more things and get the model faster even though the bigger model is actually better.
I think we should say that there's a lot of exciting stuff going on in the A_I_ space. Um my mind has recently been really focused on robotics. So we ha today really almost entirely didn't talk about robotics. Uh there's a lot of stuff on image gen video generation. Uh I think it's fair to say that the most exciting research work in terms of the amount intensity uh fervor is in the L_L_M_ space, which is why I think it's justified for us to really focus on the
L_L_M_ that we're discussing. But it'd be nice to bring in some certain things that might be useful. For example w w world models, there's growing excitement on that. Do you think there'll be any use in this coming year for world models in the L_L_M_ space?
Uh yes, I s I do think uh so also with L_L_M_s uh what's uh interesting thing here is I think if we unlock more L_L_M_ capabilities, it also automatically unlocks all the other fields because uh or not unlocks, but like makes progress faster as uh because uh you know a lot of researchers and engineers use L_L_M_s like we said for coding. So even if they work on robotics, if you ni optimize these L_L_M_s that help you with coding, you know, it's like it pays off uh but then uh yes, the wor models are interesting. It's basically where you have the model
run a simulation of the world in a sense like a little toy thing of the real thing which can and again unlock capabilities um like that are n that L_M_ is not aware of, it's can simulate things and uh I think see this is like something I think L_ and M_s they just happen to work well by pre-training and then doing the next token prediction but w we could do this even a bit you know like um sophisticated in a sense. So uh what I'm saying is like with uh there's like um I think it was by
paper uh code of world models. Um so where they basically apply the concept of uh world models to L_L_M_s again, where they and so instead of just having next token prediction and verifiable rewards checking the answer correctness, they also make sure the intermediate variables are correct. You know like, it's kind of like a mm the model is learning basically a code environment in a sense. And I think this makes a lot of sense. It's just like expensive to do, but uh it is like making things more sophisticated, like
um like modelling the whole thing not just the result in uh so it i it can add more uh value. I remember when I was a grad student there is a um so there's a competition called CASP I think where they do m uh protein structure prediction. Like they predict the pr uh structure of a protein that is not solved yet uh at that point. So i in the sense this is actually great and and I think we need something like that for L_L_M_s
where you do the benchmark but no one does so you hand in the results but no one knows the solution and then after the fact someone reveals that. But uh alpha fold uh when it came out it crushed uh you know this benchmark. I mean there there were also multiple very uh iterations. But I remember the first one um I'm not an expert in that subfield but the first one explicitly modelled the physical um interactions of the you know the physics of the molecule. Also like the angles impossible angles. And then in the
next version I think they got rid of this. And so and just with brute force scaling it up and I think with L_M_s we are currently in this brute force scaling because it just happens to work. But I do think also at some point it might make sense to bring back this um uh thing and uh and I think with co uh with world models I think that is where w I think that might be actually quite cool. Um w I mean yeah and and of course also for robotics. Uh that is a completely uh unrelated from L_M_s.
Yeah yeah, and in robotics it's very explicitly so there's the problem of locomotion and manipulation. Locomotion is much more solid especially in the learning domain. But there's a lot of value just like with the initial protein folding systems bringing in the traditional model based methods. So you don't it's it's unlikely that you can just learn the manipulation or the whole body local manipulation problem end to end. That's the dream. But then you realise when you look at the magic of the human hand
and the complexity of the real world you realise it's really hard to learn this all the way through. The way I guess alpha fold two didn't.
I'm excited about the robotic learning space, though I think it's collectively getting like supercharged by all the excitement and investment in language models generally, where they're getting like the infrastructure for training transformers, which is like a general modelling thing, is becoming like world class industrial tooling, where anythi wherever that was a limitation for robotics, it's just like way better there's, where we're compute. And then on top of like they take these language models and use them as kind of central units where you can do interesting explorative
around something that kind of already works. And then I see it emerging as like kind of like we talked about hugging phase transformers and hugging phase I think when I was at hugging phase I was trying to get this to happen but it was too early it's like these open robotic models on hugging phase and be having people be able to contribute data and fine tune them. I think we're much closer now that the investment in robotics and I think self-driving cars is related and it enables this where it's like once you get to the point where you can have this sort of ecosystem where somebody can download a
robotics model and maybe fine tune it to their robot or share data sets across the world then there's some data s there's some work in this area like R_T_X_ I think is a few years ago where people are trying to do that. But I think once they have this ecosystem it'll look very different and then this whole post-chapp G_ G_ B_T_ boom is putting more resources into that which I think is a very good area for doing research.
This is also resulting in a much better, more accurate, more realistic simulators being built uh closing the sim to real gap in the robotic space. But you know, you mentioned a lot of excitement in the robotics space and a lot of investment. The downside of that, which happens in hype cycles, I personally believe most robotics people believe that the it's not robotics is not going to be solved at at the time scale as being kind of implicit or explicitly
promised. And so what happens when there's all these robotics companies that spring up and then they don't have a product that works, then there's going to be this kind of crash of excitement, which is nerve-racking. There's hopefully something else will come in and keep swooping in so that the the the continued development of some of these ideas keeps going.
also related to the continual learning issue essentially where the real world is so complex where with L_L_M_s yeah, you don't need to really have something learned for the user because there are a lot of p uh things everyone has to do, everyone maybe wants to I don't know, fix their grammar in their email or code or something like that. It's it's more constrained so you can kind of prepare the model for that. But preparing the robot for the real world uh that's harder I mean you have the foundation models the, robotic foundation models, but you
you can learn certain things like grasping things but then again I think uh every so everyone's house is different you know like that it's so different and uh that is I think where the remote would have to learn on the job essentially and I think that I guess is the the bottleneck right now like how to you know customising it uh on the fly essentially.
I do I don't think I can possibly understate the importance of the thing that doesn't get talked about almost at all by robotics folks or anyone is safety. All the interesting complexities we talk about learning, all the failure modes and failure cases, everything we've been talking about at L_L_M_ sometimes it fails in its interesting ways, all of that is fun and games in the L_L_M_ space, in the robotic space, in people's homes, across millions of minutes
billions of interactions. You really are almost allowed to fail never. When you have embodied systems that are put out there in the real world, you just have to solve so many problems you never thought you'd have to solve when you're just thinking about the general robot learning problem.
And so bearish on in-home learned robots for consumer purchase. I'm very bullish on self-driving cars and I'm very bullish for robotic automation, e.g. like Amazon distribution, where Amazon has built whole new distribution centres designed for robots first rather than humans.
Mm-hmm.
The path to robots doing that is m more reasonable, where it's like a thing that is designed and optimised to do a repetitive task that a human could conceivably do but doesn't want to. And then I'm but but so uh but it's also gonna take a lot longer than people probably predict. I think that the the leap from A_I_ singularity to we can now scale up mass manufacturing in the U_S_ because we have a massive A_I_ advantage is one that is
troubled by a lot of political and other challenging problems.
Let's talk about timelines. Uh specifically timelines to A_G_I_ or A_S_I_.
Is it fair like as a starting point to say that nobody really agrees on the definitions of A_G_I_ and A_S_I_?
I kind of think there's a lot of disagreement, but among I've been getting pushback where a lot of people kind of say the same thing, which is like a d a d thing that could reproduce most digital economic work. So like the remo remote worker is a fairly reasonable example, and I think open A_I_'s definition is somewhat um related to that, which is like an A_I_ that can do a lot of economic like a certain number of economically valuable tasks, which I don't really love as a definition, but I think it could be a a grounding point, because
Um language models today wha immensely powerful are are not this for remote work or drop-in, and there are things that you could think of that are d could be done by an A_I_ that are way harder than remote work, which are like solving a finding an unexpected scientific discovery that you couldn't even pause it, which would be an example of something that somebody says is like an artificial super-intelligence problem, or like sol uh taking in all medical records and finding linkages across certain illnesses that
didn't know or is figuring out that some common drug can treat some niche cancer. Like they would say that that is like a super-intelligence thing. So these are kind of natural tiers. My problem with it is that it becomes deeply entwined with like the quest for meaning of A_I_ and these religious aspects to it. So there's kind of different there's different paths you can take it.
And I don't even know if the remote work is a good definition 'cause w what exactly is that? It's like perfect tool use. I actually I mean I like I don't know if you like the originally titled A_I_ twenty seven report. They focus more on code and research taste. So the the target there is the superhuman coder. So they have several several milestone systems. Superhuman coder, superhuman A_I_ researcher, then super-intelligent A_I_ researcher and uh and the full
artificial super-intelligence. But the after you develop the superhuman coder, everything else falls quickly. There the task is to f have a fully autonomous like automate coding. So any kind of coding you need to do in order to perform research is fully automated. And from there humans would be doing A_I_ research together with that system and they will quickly be able to develop a system that's actually
can do the research for you. That's the idea. And then um initially their prediction was twenty twenty seven twenty eight, and now they've pushed it back by three to four years to to uh twenty thirty one mean prediction. Probably my prediction is even beyond twenty thirty one, but it at least you can in a g concrete way think about how difficult it is to fully automate programming.
Yeah, I disagree with some of their presumptions and dynamics on how it would play out. Um but I think they did a good they did good work in the scenario defining milestones that are concrete and to tell a useful story, which is why the reach for this A_I_ twenty twenty seven document well transcended Silicon Valley is 'cause they told a g good story and they did a lot of rigorous work to do this. I think the camp that I fall into is that like A_I_ is like so-called jagged, which will be excellent at some things and really bad at some things. So I think
when they're close to this automated software engineer, what it will be good at is that traditional M_L_ systems in front-end the model is excellent at, but the distributed M_L_ the models are actually really quite bad at 'cause there's so little training data on doing large scale distributed learning and things. And this is something that we already see, and I think those are just get amplified. And then it's kind of messier in these trade-offs, and then there's like how do you think A_I_ research works and so on.
So you think basically superhuman coder is almost unachievable, meaning like because of the jagged nature of the thing, you're just always going to have gaps in capabilities.
I think it's assigning completeness to something where the models are kind of superhuman at some types of code, and I can think that will continue. And people are creative, so they'll utilise this like incredible abilities and like to fill in the weaknesses of the models and move really fast. And it'll always kind of be this I've received for a long time this dance between the humans are enabling this thing that the model can't do and the best the best A_I_ researchers are the ones that can enable this superpower. And I think this aligns like to what we already see. I think like cloud code for building a
you can stand up a beautiful website in a few hours or do data analysis. And I don't think it's it's gonna keep getting better at these things and it'll pick up some news code skills and stuff that it'll get along the way. And kind of linking to what's happening in in big tech is like this A_I_ twenty twenty seven report is like it leans into the singularity idea where I think research is messy and social and largely in the data in ways that A_I_ models can't process. But like what we do have today is really powerful and
tech companies are all collectively buying into this with tens of billions of dollars of investment. So like we are gonna get some much better version of chat G_P_T_, a much better version of cloud code than we already have. I think that it's just like hard to predict where that is going, but the like bright clarity of that future is why some of the most powerful people in the world are putting so much money into this. And I think it's just kind of small differences between like we don't actually know what a better version of chat G_P_T_ is, but also
can it automate A_I_ research? I would say probably not, at least in this time frame. Like big tech is gonna spend a hundred billion dollars much faster than we get a automated A_I_ researcher that enables a A_I_ research singularity.
So you think y uh your prediction would be what? Like if this is even a useful milestone, we're more than ten years out.
I would say less than that on the software side, but I think longer than that on the th things like research.
Well let's just like for fun try to imagine a world where all software writing is fully automated. Like can you imagine that world?
By the end of this year the amount of software that'll be automated will be so high, but it's like it'll be the things of like you're trying to train a model with R_L_ and you need to have multiple bunches of G_P_U_s communicating with each other, that'll still be hard, but I think it'll be much easier.
One of the ways to think about this, so the full automation of programming is just think of like lines of f useful code written, the fraction of that to the number of humans in the loop. So presumably there'll be for a long time humans in the loop of software writing is just be fewer and fewer relative to the amount of code written, right. And the the the S_C_ superhuman code, I think the p the the presumption there is it goes to zero, the number of humans in the loop. What is that
world look like when the number of humans in the loop is in the hundreds, not in the hundreds of thousands?
I think software engineering will be driven more to system design and goals of outcomes, where I do think software is largely gonna be c'mon I think this has been happening over the last few weeks where people have gone from a month ago of like oh yeah agents are kind of slop which is a famous carpentry quote to like the what is a little bit of a meme of like the industrialization of software when anyone can just create software at their fingerprints like w I do think we are closer to that side of things and it takes direction
and like understanding how the systems work to extract that best from the language models and I think it's hard to like accept the gravity of how much it's gonna change with software development and how many more people can do things without ever looking at it.
what's interesting is to think about whether these systems will be um independent like, completely independent in the sense that well, I have no doubt that L_M_S_ will kind of at some point solve coding in a sense like c calculators solve calculating, right. So at some point humans develop the tool that you know y you nee never need a human to calculate that number, you just type it in and uh it's an algorithm, you you can do it in a in that sense and I I think that's the same probably for coding, but the question is uh so I think what will happen is yeah, you will just say
that website, it will make a really good website and then you maybe refine it. But will it do things independently where so y will you be still having humans asking the A_I_ to do something? Like will there be a person say build that website? Or will there be A_I_ that just builds websites or something? Or whatever.
I think using talking about building websites is the it's just like the there's the the the problem with websites and the problem with the web, you know, H_T_M_L_ and all that kind of stuff, it's very resilient to just slop. It'll show you slop as good as showing slop. I would rather m like think of like safety critical systems like uh asking A_I_ to end to end generate something that m manages logistics or manages cars and fleet of
cars, all that kind of stuff. So and to end generate stuff for you.
I think a more intermediate example is take something like Slack or Microsoft Word. I think if the organisations allow it, A_I_ could very easily implement features end to end and do a fairly good job for like things that you want to try, you wanna add a new w like tab in Slack that you want to use and, I think A_I_ will be able to do that pretty well.
Actually that's a really great example. How far away are we from that?
Like this year.
See I, don't I don't know. I don't know. I don't know.
how bad production code bases are, but I think that within like on the order of low years a lot of people are gonna be pushed to be more of like a designer and product manager where you have multiple of these agents that can try things for you and they might take one to two days to implement a feature or attempt to fix a bug and you have these dashboards which I think Slack is actually a good dashboard where your agents will talk to you and you'll then give feedback. But things like like I make a website it's like y you wanna make a logo that's
Like I think these like cohesive design things and this style is gonna be very hard for models and deciding on what to add at the next time.
I just okay so I hang out with a lot of programmers and some of them are a little bit on the skeptical side in general. That's just vibe-wise they're like that.
I just think there's a lot of complexity involved in adding features to complex systems. Like if you look at the browser, Chrome,
Mm-hmm.
if I wanted to add a feature, if I wanted to have tabs as a as opposed to up top, I want 'em on the left side. Interface, right. Y I think we're not is not a next year thing.
One of the Claude releases this year, one of their tests was we give it a piece of software and leave Claude to run to re-create it entirely. And it could already almost rebuild scra like Slack from scratch just given the parameters of the software and left in a sandbox environment to do that.
Mm-hmm. So it might be that the smaller newer companies are advantaged and they're like we don't have to have the bloat and complexity and therefore this future exists.
specification issue. So programming like you're like you're just assuming th this is like uh in in relationships c in friendships communication type of issue. You're assuming the L_M_ somehow is supposed to read your mind. I think this is where spec driven design is really important. Like you just m using natural language specify like what you want.
I think that's like if you've talked to people at the labs, they use these in their training and production code. Like CLOD code is built with CLOD code, and they all use these things extensively, and Dario talks about how much of CLOD's code oh and and it's like these people are slightly ahead in terms of the capabilities they have, and they probably spend on inference they could spend ten to a hundred plus X_ as much as we're spending. Like we're on a lowly a hundred or two hundred dollar a month plan. Like they truly let it rip. And I think that
that like with the pace of progress that we have it it seems like like where a year ago we didn't have a cloud code and we didn't really have reasoning models and it's like the difference between sitting here today and what we can do with these models and it seems like there's a lot of lo like there's a lot of low hanging fruit to improve them. The failure modes are pretty dumb. It's like cloud you tried to use the C_L_I_ command that don't have installed fourteen times and then then I sent you the command to run it's like that thing from a modelling perspective is pretty
flexible. So I uh
I agree with you. I've bec been uh becoming more and more bullish in general. Speaking to what you're articulating, I think it is a human skill issue. So anthropic is leading the way in uh n or other companies in understanding how to best use the models for programming therefore, they're e effectively using them. I think there's a lot of programmers on the outskirts they're, like they don't I mean there's not a really good guide on how to use them. People are trying to figure it out
It might be very expensive. Like it might be that the entry point for that is two thousand dollars a month, which is only tech companies and rich people. Just like like that could be it.
But it might be worth it. I mean if if if the final result is is a working software system, well it might be worth it. But by the way, it's funny how we converge from the discussion of timeline to A_G_I_ to something more pragmatic and and useful. Is there anything concrete and interesting and useful and profound to be said about timeline to A_G_I_ and A_S_I_ Or? are these discussions a bit too detach from the day to day?
there's interesting bets. So there's a lot of people trying to do b reinforcement learning with verifiable rewards, but in real scientific domains where there's start-ups that are spending like they have hundreds of millions of dollars of funding and they have wet labs where they're having language models propose hypotheses that are tested in the real world. And I d I would say that I think they're very early or they're early, but with the pace of progress it's like maybe they're early by six months and they make it because they were there first or maybe they're early by eight years so you don't really know. So I think that that type of moon
to um branch this momentum into other fi other sciences is like okay like that would be very transformative if like uh alpha fold moments happen in all sorts of other scientific domains by like a startup solving this. I think there are startups I think maybe harmonic is one where they're going all in on language models plus lean for math. I think you had another podcast guest where you talked about this recently and it's like we don't know exactly how it's gonna fall out of spending a
million dollars on that model. And most of them will fail, but a couple of them might be big breakthroughs that are very different than CHI G_V_T_ or cloud code type software experiences. Like a tool that's only good for a P_H_D_ mathematician, but makes them a hundred X_ effective like,
Okay, I agree. I think this will happen uh in a lot of uh domains, especially also like f uh domains that have a lot of um you know, resources like finance and legal and pharmaceutical companies. But then again, is it really A_G_I_ again, because we are now specialising it again. And then again, is it really that much different from back in the day how we had specialised algorithms I? I think it's just the same thing more way more so s sophisticated, but w I dunno is, there a threshold when we call it A_G_I_ I guess I? think
the the real cool thing is here that we have like the the foundation models that we can specialise. I think that that's like the the break-through at some point right now. I think we are not there yet because well, m first uh it's too expensive, but also you, know, like J_G_P_D_ doesn't just give away that J_G_P_D_ to customise it. I think once that's gonna be true in a some w and I think I can imagine this as a business model, that uh J_G_P_D_ op may I say at some point like hey, you know, Bank of America for a hundred million, we will do your custom model or something like that. And I I think that will be
huge economic uh value add. The other thing though is also companies, I mean right now what is the differentiating factor? I mean if everyone uses the same L_L_M_, if everyone uses uh G_P_D_ they will all do the same thing again. I mean then well it it's everyone is moving in lockstep, but usually companies they want to have a competitive advantage and I think there's the no way around using some of their private data and experimenting and maybe specialising. Uh it's gonna be interesting, yeah.
Sitting in the pace of progress it does just feel like things are coming. I don't think the A_G_I_ and A_S_I_ thresholds are particularly n useful.
I d I think I guess the real question and this takes us to the remote worker thing is when are we going to see a d a big obvious leap in a c economic impact? 'Cause currently there's not been an an obvious leap in economic impact of L_L_M_ models for example. And that's you know aside from A_G_I_ or A_S_I_ or all that kind of stuff there's a real question of like when are we gonna see a G_D_P_ like
Mm-hmm mm-hmm.
Yeah, it's like what is the G_D_P_ made up of? Like a lot of it is like financial services, so like I don't I don't know what this is. Uh it's just hard for me to think about the G_D_P_ bump, but like I'd say that software development becomes valuable in a different way when you no longer have to look at the code anymore. So when when it is like cloud'll make you a b small business, which is essentially cloud can set up your website, your bank account, your email and your whatever else, and like you just have to express
like what you're trying to put into the world, like that's not just a d enterprise market, but it is a hard like I don't know how you get people to try doing that. I guess if Chad G_P_T_ can do it, like people are trying Chad G_P_T_.
I think it boils down to the the scientific question of how hard is tool use to solve.
There's a lot of the stuff you're implying, the remote work stuff is t tool use. It's like how computer use, like how you have an L_M_ that goes out there, this agentic system, and does something in the world and only screws up one percent of the time.
computer use is a good ex example of what labs care about and we haven't seen a lot of progress on. We saw multiple demos in twenty twenty five of like cloud can use your computer or OpenAI had Kua and they all suck. So like they're also investing money in this and they think that'll be a good example. Where that's actually something where it just seems
for the model to work in. Like they're not working on your Mac book. They are individually interfacing with Google and Amazon and Slack and they handle all these things in a very different way than humans do. So some of those might be structural blockers.
Also like specification-wise I think the problem is also for you know arbitrary tasks uh well you still have to specify what you want your L_L_M_ to do and how do you do that in a w w what is the environment, how do you specify uh you can say what the end goal is but w w if it can't solve the end goal with L_L_M_s if you ask it for text you can always you know clarify, do sub-steps, what is uh how how do you put that information into a system that that say books a travel trip for you, you can say well you screwed up my credit card information but
even to get it to that point, like how do you like as a user guide the model uh before like it can even attempt that. I th I think the interface is really hard.
Yeah, it has to learn a lot about you specifically and about this goes to continue continue learning ab about the general mistakes that are made throughout and then mistakes that are made through you.
Yeah.
Mm-hmm.
Mm-hmm.
Engage. Some people really like this pulse feature, which is it processes your chats and automatically searches for information and puts it in the chat G_B_T_ app. So there's a lot of things coming to that.
I used that feature before and I always feel bad because it does that every day and I rarely check it out. It's like how much money uh like I mean compute is burned on something I don't even look at, you know, where it's like it's kind of like old f yeah, sure. Okay. Do you
Uh new ideas might be needed. Is it possible that the path to A_G_I_, whatever that is, however defined that, to solve computer use more d more generally, to solve biology and chemistry and physics, sort of the Dario definition of A_G_I_ or powerful A_ I_ do you think t is possible that totally new ideas are needed? Non-L_L_M_, non-R_L_ ideas. What might they look like? This is we're not going
into philosophy land a little bit.
for something like a singularity to happen, I would say yes. And the new ideas could be architectures or training algorithms, which is like fundamental deep learning things. But there's in that nature pretty hard to predict and I but I think we will get very far even without those advances. Like we might get this software solution, but it might stop at software and not do computer use without more innovation. So I think that it's like a lot of progress will be coming, but in if you're gonna zoom out like there's
still ideas in the next thirty years that are gonna look like that was a major like scientific innovation that enabled the next chapter of this, and I don't know if it comes in one year or in fifteen years.
Yeah, well I wonder if the bitter lesson holds true for the next hundred years, what that looks like.
If scaling laws are fundamental in deep learning, I think the bitter lesson will always apply, which is compute will become more abundant, but even within abundant compute, the ones that have a steeper scaling law slope or a better offset, like this is a two D_ plot of performance in compute, and like even if there's more compute available, the ones that get a hundred X_ out of it will win.
It might be something like literally compute clusters orbiting Earth.
solar panels.
The problem with that is heat dissipation. So you get all the radiation from the sun and you don't have any air to his dissipate heat, but there is a lot of space to put clusters. There's a lot of solar energy there and you could figure out the heat dissipation, but there is a lot of energy and there probably could be engineering will to solve the heat problem, so there could be.
Is it possible, and we should say that it definitely is possible how likely it is uh is the question, that we're basically going to be plateauing this year. Not in terms of the system capabilities, but what the system capabilities actually mean for human civilisation. So on the coding front, really nice websites will be built. Um very nice autocomplete, very nice uh way to understand code
and maybe help debug, but really just a a very nice helper on the coding front. It can help research mathematicians do some math. It can help you with shopping, it can help you with it could help it's a nice helper, it's clippy on steroids. Uh what else? It may be a good education tool and all that kind of stuff, but computer use turns out extremely difficult to solve. So I'm trying to be uh uh I'm trying to frame the
clinical case in all these domains where it kinda there's not a really huge economic impact, but we realise how costly it is to train these systems at every level, both the pre-training on the inference, how costly the inference is, the reasoning, all of that. Uh like is that possible and how likely is that do you think?
you look at the models there's so much obvious things to improve and it take a long time to train these models and to do this art and that it'll take us with the ideas that we have multiple years to actually saturate in terms of whatever benchmark or performance we are searching for, it might serve very narrow niches, like the average strategy B_T_ eight hundred million user might not get a lot of benefit out of this, but it is going to serve different populations by getting better at different
But I get I think what everybody's chasing now is the is uh a general system that's useful to uh everybody. So okay, so if that's not that can plateau, right?
I think that dream is actually kind of dying. As you talked about with the specialized models where it's like t and multi-modal is often a t like video generation is a totally different thing.
That dream is kind of dying is a big statement. 'Cause I don't know if it's dying. I don't know if every I don't if you ask the actual financial lab people, they I mean they're still chasing it, right?
do think they are still um like r rushing to get the next model out which will be much better than the pr not just a m uh relative term but will be better than the previous one and I d I d I can't see them slowing down. I just think the gains will be made or felt more through not only scaling the model but now fine so I I feel like there's a lot of tech that is like well let's just put the better model in there and m m better model and better model and now uh people are okay let's also at the same time
everything around it to like you know like the engineering of the context and inference scaling and I the big labs will still keep doing that and now also the smaller labs will catch up to that because now uh it's just like they are hiring more, there will be more people, L_L_M_s, it's kind of like you know like a a circle, they also make them more productive and it it's just a it's like amplify. I think what we can expect is amplification but not like uh a change of an like a paradigm change, I don't think that is true but w the b everything will be just amplified and amplified and
I didn't I I couldn't see that continuing for a long time you, know.
Yeah, I guess my statement with the dream is dying depends on exactly what you think it's gonna be doing. Like cloud code is a general model that can do a lot of things, but it's not like necessarily like it depends a lot on integrations and other things. Like I bet cloud code could do a fairly good job of doing your email and the hardest part is figuring out how to give the information to it and how to get it to be able to send your emails and stuff like this. But that's just kinda like I think it goes back to like what is the one model to rule
ethos, which is just like a thing in the cloud that handles your entire digital life and is way smarter than everybody. It's like it's operating in a
It's a it's a an interesting leap of faith to go from cloud code becomes that, which i like in some ways is
there's some avenues for that, but I do think that like the rhetoric of the industry is a little bit different.
I think the immediate also thing we will feel next as a normal person using L_L_M_s is will will probably be related to something like also trivial, like making figures. Uh right now uh L_L_M_s are terrible at making figures. Is it because we are getting served the cheap models with very less or like l less uh inference compute than behind the scenes? Maybe some like there are some cranks we can already get better figures. But if you ask today, I don't draw a flow chart of X_Y_Z_, it's most of the time terrible. And it is kind of
like a very simple task for a human. I think it's almost e easier sometimes to draw something than to write something.
Yeah, the multi-modal understanding does feel like something that is odd that it's not better solved.
I I think we're not saying one actually obvious thing that we're not actually realising that's a gigantic thing that's hard to measure, which is making all of human knowledge accessible to the entire world. Like we d I don't I d one of the things that I think is hard to articulate, but there's just the huge difference between Google search and an L_L_M_. Like I feel like I can basically ask an L_M_ anything and get an answer. And it m
is doing less and less and less hallucination. And that means understanding my own life, figuring out a career trajectory, figuring out how to solve the problems all around me, uh learn about anything through human history, that like w I I feel like nobody's really talking about that uh because they just immediately take it for granted that it's just this is awesome. That's why everybody's using it, is 'cause you get answers for stuff.
the impact of that across time. Like think about this is not just in um United States, it's all across the world, like kids throughout the world being able to learn these ideas, like w the impact that has across time is g is prob that's where the real like g talk about G_D_P_ it, won't be like a leap, it'll be that's how we get to Mars, that's how we build these things, that's how we have a mm a million new open A_Is_ uh all the kind of innovation that happens from there. And that's
just this quiet force that permeates everything, right. Human knowledge.
I do agree with you and uh in a sense uh you make it makes knowledge more accessible, but um it also I think depends on what the topic is. For something like math um in a sense you can ask it questions, it answers, but if you wanna learn a topic from scratch uh I think it th again like we talked about this earlier, I I think the sweet spot is I mean there are really good math textbooks where someone laid it out linearly and that is like um that's a proven strategy to learn this
topic. And it does make sense if you start from zero to ramp up to get like a like a information dense text to soak it up. But then you use the L_L_M_ to make infinite exercises. Like you you have problems in a certain area and uh or have questions that something's unsh uh uncertain. Or like you are uncertain about certain things. You ask it to generate example problems, you solve them and you have questions and then m maybe you need you need more background knowledge and you ask it to to generate that. And I think but th then the
won't give you anything let's say that is not in the the the textbook, it's just packaging it differently if if that makes sense. But then there are things I feel like where it also adds value in a more I mean timely sense where there is no good alternative besides a human doing it on the fly. For example if you I dunno like let's say you're planning to go to Disneyland and you fe try to figure out which tickets to buy for which park when, well there is no textbook on that, there is no information dense
resource and that is only the sparse internet. And then there is a lot of value in the L_L_M_. You just ask it, it has you have the constraints I'm, travelling these and these days. I want to go there and there. Please figure out what I need when and from where and what I'm what it costs and stuff like that. And it it is very customised on the fly uh package and then this is like one of the thousand examples an exercise personalised uh personalisation is essentially like pulling information from the sparse internet, the non-information
dense uh thing where it there is no better version that exists, it it's just doesn't exist, you make it from scratch almost.
And if it does exist, it's full of uh speaking of Disney World like full of what would you call it, ad slob? Like you just it's impossible to g uh here, you go g any city in the world. W what are the top ten things to do? L_M_ is just way better to ask than anything on the internet.
That's 'cause they're massively subsidised and they're gonna be paid for by ads. Uh-huh. Uh-huh. Uh-huh. Uh-huh. It's coming.
maybe comes up first, maybe, maybe not. And so but I think there are clear laws around this. You have to be clear about that. But I think that's what everyone fears. It's like the subtle um you know, m subtle message in there or something like that. But it also brings us to the topic of uh I guess ads uh where I think this was the thing op may I try to launch in two thousand twenty five and uh just to m because I think it's still not uh making money in that other way uh right now. So that like having really like ad spots in there and then the
though is they couldn't because well uh there are alternatives without ads and people would just flock to the other products and it also is is just like crazy how yeah like they're one upping each other spending so much money to just get the users.
I think so like some Instagram ads I don't use Instagram but I understand the
appeal of paying a platform to find users who will genuinely like your product and that is the best case of things like Instagram ads. But there are also plenty of cases where advertising is very awful for incentives and I think that a world where the power of A_I_ can integrate with that positive view of like I am a person and I have a small business and I want to make the best, I dunno, damn steak knives in the world and I want to sell them to somebody who needs them and if like if
AI can make that sort of advertising thing work even better, that's very good for the world, especially with like digital infrastructure, because that's how like the modern web has been built. But that's not to say like addicting feeds so that you can show people more content is a good thing. So it's like I th I think that's even what opening AI would say is they want to find a way that can make the monetization upside of ads while still giving their users agency. And I m I personally would think that Google is
probably gonna be better at figuring out how to do this 'cause they have s they already have ad supply and they figure out how to turn this demand in their Gemini app into useful ads then, they can turn it on and somebody will figure I don't know if I think it's this year but there will be experiments with it.
I do think what holds companies back right now is really just the that the competition is not doing it, it's more like more like a reputation thing. It it's just like I think people are just afraid right now like ruining or like losing their reputation losing, users because it is it would make headlines if someone launched these ads.
Mm-hmm.
like that where it will say like promote it or something like small and then there will be an image or something. I th I think right now the problem is who makes the first move.
If we go ten years out the proposition for ads is that you will make so much money on ads by having so many users that you can use this to funnel better R_ and D_ and make better models which is why like YouTube is dominating the market for any like Netflix is scared of YouTube like they have the ad like they make I don't I pay twenty eight dollars a month for premium they make at least twenty eight dollars a month off of me and of many other people and they're just like creating such a dominant position in video. So
I think that's the proposition which is that ads can make you have a sustained advantage in what you're spending per user. But there's so much money in it right now that it's like like somebody starting that flywheel is scary 'cause it's a long term bet.
Uh do you think there'll be some like crazy big moves this year business-wise? Like somebody like Google or Apple acquiring anthropic or something like this?
Daria will never sell, but uh we are starting to see some s types of consolidation with like GROC for twenty billion dollars and um scale A_I_ for almost thirty billion and countless other deals like this that they're structured in a way that is actually detrimental to the Silicon Valley ecosystem, which is this sort of licensing deal where not everybody gets brought along rather than a full acquisition that benefits the rank and file employee by getting their stock vested. Like that's a big issue for Silicon Valley culture
to address because the start-up ecosystem is the lifeblood where if you get a co if you join a start-up, even if it's not that successful, your start-up very well might l get acquired on the cheap premium of it and you'll get paid out for this equity and these licensing deals are essentially taking the top talent a lot of the times. I think GROC they deal for GROC to NVIDIA is rumored to be better to the employees, but it is still this anti-trust avoiding thing, but I think that this trend of consolidation will continue.
been me and many smart people I respect have been expecting col consolidation to have happened sooner, but it seems like some of these things are starting to turn which but at the same time you have companies raising ridiculous amounts of money for m m reasons that you don't like. I'm like I don't know why you're taking that money. So it's maybe like mixed this year, but some consolidation pressure is starting.
What kind of surprising consolidation do you think we'll see? So you're saying a topic is is a never I mean GROC is a big one, GROC with a cube by the way.
Yeah. There's just a lot of start-ups and there's a very high premium on A_I_ start-up. So there's a lot of like there could be a lot of ten billion range acquisitions, which is a really big acquisition for a start-up that was maybe founded a y like a year ago, I think Main S_A_I_ from this company that's based in Singapore that Meta founded was founded eight months ago and then had a two billion dollar exit. And I think that there'll be some other big like many billion dollar acquisitions like Perplex yeah, like uh people rumoured them to Apple.
I think there's a lot of pressure and liquidity in A_I_. There's pressure on big companies to have outcomes and I've I would guess that a big acquisition gives people leeway to then tell the next chapter of that story.
I mean yeah, it's a c I guess cursor. We've been talking about code and somebody acquires cursor.
they're in such a good position by having so much user data. And we talked about continual learning and stuff. They had one of the most interesting like two sentences and a blog post which is that they had their new composer model which was a fine tune of one of these large mixture of expert models from China. You can know that by asking Gossip or because the model sometimes responds in Chinese which none of the American models do. And they had a blog post where they're like we're updating the model weights every ninety minutes based on real world feedback from people using it which is
the closest thing to real world R_L_ happening on a model. And it's just like in one of their blog posts which is super cool.
And uh and by the way I just I say I use composer a lot 'cause it's one one of the benefits that it has is it's fast.
I need to try it 'cause everybody says this.
And there'll be some I_P_O_s potentially. You think Nthropic, Open A_I_, X_A_I_?
they can all raise so much money so easily that they don't feel a need to like so long as fundraising is easy they're not gonna I_P_O_ because public markets apply pressure. I think we're seeing in China that the ecosystem's a little different with both minimax and Z_ dot A_I_ applying for um filing I_P_O_ paperwork which would be interesting to see how the Chinese market reacts. I actually would guess that it's gonna be like similarly hypey to the U_S_ so so long as all this is going and not based on the realities that they're both losing a ton of money.
I wish more of the American's gigantic A_I_ start-ups were public because it would be very interesting to see how they're spending their money and have more insight and also just to give mm people access to investing in these 'cause I think that they're some of the most like formative like they're the t companies of the era and the tradition is now for so many of the big start-ups in the U_S_ to not go public. It's like we're still writing for Stripe and the A_I_P_O_ but Databricks definitely didn't, they raised like a series G_ or something. And I just
it's a kind of a weird equilibrium for the market where it's like I would like to see these companies go public and evolve in that way that a company can.
You think ten years from now some of the frontier model companies are still around, anthropic open A_I_?
I definitely don't see it to be a winner takes all unless there truly is so algorithmic secret that one of them finds like let's say flywheel 'cause the development path is so similar for all of them. Google and open A_I_ have like all the same products and then like anthropic's more focused but when you talk to people it sounds like they're solving a lot of the same p uh problems so I think and there's offerings that'll spread out there's, a lot of it's a very big cake that's being made that people are gonna take money out of.
I I don't wanna trivialise it but uh so open A_I_ and and anthropic are primarily L_L_M_ service providers and some of the other companies like Google and X_A_I_ linked to X_ does other stuff too. And so it's very possible if A_I_ becomes more commodified that the companies that are just providing L_O_M_ will die.
I think they will the pr the advantage they have, they have a lot of users and I think they will just pivot. I think um then uh if they figure out uh so like entropic I think pivoted. I don't think they originally um planned to work on code, but it happened that they found okay this is like a nice niche and now we are comfortable in this niche and we push on this niche and I can see the same thing once. Maybe let's say hypothetically speaking I I'm not sure it will be true, but let's say Google takes all the market share of the general chatbot. Maybe open I will be then f m focus on
some other sub uh topic like the they have too many users to go away in in foreseeable future, I think.
I think Google is always ready to say whoa, might be over the A_I_ mode.
I think that the question is if the companies can support the valuations. I think I'd see the A_I_ companies being looked at in some ways like A_W_S_, Azure and G_C_P_ are all competing in the same space and all very successful businesses. There's a chance that the A_P_I_ market is so unprofitable that they go up and down the stack to products and hardware. They have so much cash that they can build power plants and build data centres, which is a durable advantage now. But there's also just a reasonable outcome that these A_P_I_s are so
and so flexible for developers that they become the likes of like a something like A_W_S_. But A_W_S_ and Azure are also gonna have these A_P_I_s. So there's some like that's a s like five or six people competing in the A_P_I_ market is hard. So maybe like that's why they get squeezed out.
You mentioned R_I_P_ llama. Is there a path to winning for meta?
I think nobody knows they're moving a lot. So they're signing licensing deals with um Black Forest Labs, which is the image generation or mid-journey or client ma mainus. So I think in some ways it's on the product and like consumer facing A_I_ front, it's too early to tell. I think they have some people that are excellent and o very motivated being close to Zuckerberg. So I think that there's still a story to unfold there.
LAMA is a bit different, where LAMA was the most focused expression of the organisation and I don't see LAMA being um supported to that extent. I think it was a very successful brand for them, so they still might do some part of participation in the okem open ecosystem or continue the LAMA brand into a different surface. Does people know what LAMA is?
Do you think there's a LAMA five?
Not an open-weight one.
uh it's interesting I think also just to recap a bit I think I mean LAMA was the I would say pioneering open weight model and then LAMA one two three lot of love but I think then I think what happened uh just hypothesizing or speculating I think the um leaders at META like the upper uh executives they I think they got really excited about LAMA because they saw how popular w it was in the community and then I think the problem was trying to let's say monetize the open s uh not monetize the open source but like kind of use
open source to make a bigger splash in a se like to kind of force it almost i it felt forced like developing these very big llama four models to have like the best like to be on the top of the benchmarks. But I don't think the goal of llama models is to be on top of the benchmarks, beating let's say ChatterPity or other models. I think the goal was to have a model that people can use, trust uh, modify, understand it, so and that includes having smaller models. They don't have to be the best models and what happened was just these models were of
like the benchmarks suggest that they were better than they were by, because I think they had like specific models trained on preferences that they performed well on the benchmarks. So it's kind of like this overfitting thing to kind of force it to be the best. But then at the same time uh they didn't do the small models that people could use. I I think that no one could run these big models then. And then there was kind of like a weird thing. And I think it's just because people got too excited about uh headlines pushing the frontier, I think I think that's a good thing.
Yeah, like I th I think it imploded under political pri like internal political fighting and misaligned incentives. So I think the researchers wanna build the best models, but there's a layer of organisation and manager that is trying to demonstrate that they do these things. And then there's lots of there's a lot of pieces and rumours where how like some horrible technical decision was made and how that comes in. And it just seems like it kind of got too bad where it all just crashed out. But w we shou we should
Mm-hmm.
also like gives huge props to Mark Zuckerberg. Th I think it comes from Mark actually. From Mark Zuckerberg from the top of the leadership saying open source is important. I think that's like that if the the fact that that exists means there could be a llama five where they learn the lessons from the bench maxing and say we're gonna m be G_P_T_O_S_S_ and m and provide really awesome library of open source.
what people say is that there's a debate between Mark and Alexander Wong, who is very bright, but much more against open source, and to the extent that he has a lot of influence over the A_I_ org, it seems much less likely. Because it seems like Mark brought him in for like a fresh um leadership aid in directing A_I_, and if the like open or closed is no longer the defining nature of the model, I don't expect that to be a defining argument between Mark and Alex. So like they're both very bright. But I just
I have a hard time understanding all of it, because Mark wrote this piece in July of twenty twenty four maybe, which was like probably the best blog post at the time saying the case for open source A_I_ and then July twenty twenty five came around and it was like we're reevaluating a relationship with open source. So it's just kinda like but I think also the the problem uh not the problem, but I think well we may have been a bit also too harsh I think and that caused some of that because I think a
I mean we as open source developers or the open source community because I think even though the model was maybe not what everyone hoped for, it got a lot of backlash and I think that was a bit unfortunate because I can see that as a company now they were hoping for positive headlines and uh m instead of just getting no headlines or not these positive headlines on in in turn they got negative headlines. And then all it kind of reflected bad on the company and I think that is also something like where you it's maybe a spite reaction, it was like
we have not n we we tried to do something nice, we tried to give give you something cool like an open source model, and now you are like you know mm kind of like be negative about us even like like for the company so in that sense it looks like well maybe then we'll change our mind I guess I, dunno.
Yeah that's that's where the uh the dynamics of discourse on X_ can lead us as a community astray 'Cause. sometimes it feels random people pick the thing they like they don't like. Maybe we could see the same thing with GROC uh for one and GROC code FAST one. I don't think vi-wise people um l love it publicly. But a lot of people use
it. So if you look to Reddit and X_ they don't really give it praise from the programming community. But like they use it. And the same thing with probably with llama. I don't understand I don't understand the dynamics of either positive hype or negative hype. I don't understand it.
I mean the story of one of the stories of twenty twenty five is the U_S_ feeling the gap of llama, which is like all the rise of these Chinese open-wave models to the point where I was like that was the single issue I've spent a lot of energy on in the last five months is like trying to do policy work to get the U_S_ to invest in this. It's like mm-hmm.
So should tell me the story of Adam.
Adam Project is it started as me calling it the American deep-seek project which doesn't really work for D_C_I_ audiences. But it's the story of like what is the most impactful thing I could do with my career, which is that these Chinese open-wave models are cultivating a lot of power and there is a lot of demand for building on these open models, especially in enterprises in the U_S_ that are very cagey about these Chinese models.
Going to perplexity, the Adam Project, American truly open models is a U_S_ based initiative to build and host high-quality, genuinely open-weight A_I_ models and supporting infrastructure explicitly aimed at competing with and catching up to China's rapidly advancing open-source A_I_ ecosystem.
I think the one sentence summary would be that or two sentences. One is a proposition that open models are going to be an engine for A_I_ research because that is what people start with, therefore it's important to own them. And the second one is therefore the U_S_ should be building the best models so that the best researcher happens in the U_S_ and those U_S_ companies take the value from being the home of where A_I_ research is happening. And without more investment in open models we have all the plots on the website
where it's like Quinn Quinn Quinn Quinn and it's all these models that are excellent from these Chinese companies that are cultivating influence in the U_S_ and China and internationally and I think the U_S_ is spending way more on A_I_ and the ability to create open models that are half a generation or a generation beyond what the cutting edge of a closed lab says costs orders of like hundred million dollars which is a lot of money but not a lot of the money to these companies. So therefore we need a centralising force of people
who want to do this, and I think we got signed engagement from people pretty much across the full stack, whether it's policy.
So there has been support from uh the administration?
I don't think anyone in the like technically in government has like signed it publicly, but I know that people that have worked in A_I_ policy, both in Biden and Trump administration are very supportive of trying to promote open source models in the U_S_. I think for example A_I_ two got a grant from the N_S_F_ for a hundred million dollars over four years, which is like a b the biggest C_S_ f grant the N_S_F_ has ever awarded, and it's for A_I_ two to attempt to this, and I think it's a starting point. But the best thing happens
there are multiple organisations building models because they can cross-pollinate ideas and kind of build this ecosystem. Like I don't think if it just works if it's just LAMA releasing models to the world because then you can see LAMA can go away. The same thing applies for A_I_ too where it's like I can't be the only one building models. And I think that it's like that it becomes a lot of time spent on talking to people whether they're in policy. I know Nvidia is very excited about this. I think Jensen Wong has been specifically talking
the urgency for this, and they've chain they've done a lot more in twenty twenty five where the Nemetron models are more of a focus, they've started releasing some data along with NVIDIA's open models and like very few companies do this, especially of NVIDIA's size, so like there is there is signs of progress and there we hear about reflection A_I_ where they say their two billion dollar fundraise is dedicated to building U_S_ open models and I feel like their announcement tweet is like it reads like a blog post sound rate and I think that that cultural
tide is starting to turn. I think in in July was when we had like four or five deep sea caliber Chinese open weight models in zero from the U_S_. And that's a that's the moment where I was released to this and I was like oh I guess I have to spend energy on this 'cause nobody else is gonna do it. So it takes a lot of it takes a lot of people contributing together and I don't say that like the ADAM project isn't like the thing that's helping to move the ecosystem, but it's people like me doing this sort of thing to get the word out.
Uh do you like the the twenty twenty five America's A_I_ action plan? That includes open source stuff. The White House A_I_ action plan includes a dedicated section titled encourage open source and open-weight A_I_ defining such models and arguing they have unique value for innovation and startups.
Yeah. I mean like the A_I_ action plan is a plan, but largely I think it's like maybe the most coherent policy document that has come out of the administration and I hope that it largely succeeds and I know people that have worked on the A_I_ action plan and the challenge is taking policy and making it real and I have no idea how to do this as an A_I_ researcher, but like there's like largely a lot of things on that were very real and there's a huge build-out of A_I_ in the country and it's like there are a lot of issues that people are hearing about
water use to whatever and like we should be able to build things in this country but also we'd need to not ruin places in our country in the process of building it and it's worthwhile to spend energy on. I think that's a role that the federal government plays it's like they set the agenda and with A_I_ setting the agenda that open weight should be a first consideration is like that's a large part of what they can do and then people think about it.
Also for education and uh talent for these companies, it's I think very important because otherwise you know if there are only closed um models, how do you get the next generation of people contributing at some point? Because otherwise you will at some point only be able to learn w after you join a company, but then at the that point like how do you hire talented people, how do you uh identify h talented people. And I think open source is that's even uh for a lot of things, but also even just for educating
population and training the next generation of researchers. It's the way or the only way. The way that I could have gotten this to more go more viral is was to tell a story of Chinese A_I_ integrating with an authori authoritarian state and being A_S_I_ and taking over the world and therefore we need our own American models. But it's very intentional for why I talk about innovation and science in the U_S_ because I think it's both more realistic as an outcome, but just like it's like it's a world that is that would like to manifest. I would say though also
even like let's say uh any open weight model I do think is uh a variable model.
Yeah. And my argument is that we should be in a leading position. But I think that it's worth saying it so simply because there are still voices in the A_I_ ecosystem that say we should consider banning releasing open models due to the safety risks. And I think it's worth adding that I think effectively that's impossible without making the U_S_ like it have its own great firewall, which is also known to not um work that well because the cost for training these models, whether it's one to a hundred million dollars, is
attainable to a f huge amount of people in the world that want to have influence. So these models will be getting trained all over the world. And these we want the f models especially whe like I mean there are safety concerns, but we want these information and tools to flow freely across the world and into the U_S_ so that we people can use them and learn from them. And we like stopping that would be such a restructuring of our internet that it seems impossible.
do you think maybe in that case the big open weight models from China are actually a good thing in a sense like for the uh U_S_ companies because maybe the U_S_ companies you you mentioned earlier there are usually one generation behind in terms of what they release open source versus what they are using for example G_P_T_O_S_ as as might not be the cutting edge model, JAMA three might not be um but they do that because they know this is safe to release but then when they see these companies see for example there is deep seek version three point two which is really mm awesome and it gets
and there is no backlash, there is no security risk, that could then again encourage them to release better models. Maybe that m that in a sense is a very positive thing.
A hundred percent. These Chinese companies have set things into motion that I think would potentially not have happened if they were not all releasing models. So I think with this like m I'm I'm almost sure that those discussions have been had by leadership.
Is there a possible future where the dominant models, A_M_ models in the world are all open source?
depends on the trajectory of progress that you predict. If you think saturation and progress is even coming within a few years, though essentially within the time where financial support is still very good, then open models w will be so optimised and so much cheaper to run that they will win out. Essentially this goes back to open source ideas where so many more people will be putting money into optimising this serving of these open-weight common architectures that they will become standards, and then you could have chips dedicated to them and it'll be way
than the um offerings from these close companies that are custom.
We should say that A_I_ twenty seven report kinda predicts one of the things it does from a narrative perspective is that there'll be a lot of centralization. As the A_I_ system gets smarter and smarter, the national security concerns will s come to be and you'll centralize the labs and you'll become super secretive and there'll be this whole race from a military perspective of how do you between China and the United States. And so all of this fun conversations we're having about L_M_s all the
s the soldiers will come into the room and be like alright, we're now on the Manhattan project stage of this whole thing.
I think two thousand twenty five six seven twenty seven I don't think something like that is even remotely possible I. mean you you can make the same argument for computers right, you can say okay computers are capable and we don't want the general public to get them or chips even A_I_ chips but you see how like you know Huawei makes chips now you know took a few years but and I think tha I don't think there is a way you can contain something like that like knowledge like that I. I think th in this day and A_ age it
impossible, like the internet, I don't think this is a possibility.
On the Manhattan Project thing, one of my funny things making out of 'em is I think that like a Manhattan Project-like thing for open models would actually be pretty reasonable 'cause it wouldn't cost that much. But I think that that will come, but it seems like culturally the companies are changing. But I agree with Sebastian on all the the stuff that you just said, it's just like I don't see it happening nor being helpful.
Yeah, I mean the the motivating force behind the Manhattan Project is there a civilizational risk. How do you I it's harder to m motivate that for open-source models.
There's not civilizational risk.
Uh you think uh on the hardware side, we mentioned NVIDIA c uh a bunch of times, do you think Jensen and and video are gonna keep winning?
I think they have the downside that they have to iterate a lot and uh manufacture a lot and I think they probably what what they're doing it they do innovate but um I think there's always the chance that there is something who does something fundamentally different who gets very lucky and then does something but the problem is I think adoption you know like the p the mode of NVIDIA is probably not just the G_P_U_ it's more like the CUDA ecosystem and that has evolved over so many t I think two decades. I think I
even back when I was a grad student we I was in a lab uh we did biophysical simulations molecular, dynamics and we had a Tesla G_P_U_ back then just for the computation that was b are in not fifteen years ago now and it just they built this up for a long time and uh that's like that's the mode I think it's not the the chip itself although they have now the money to iterate and build and scale but, then it's really on the compatibility it's like well if you're at that scale as a company why would you go with something risky where it's
only a few chips that they can make per year, uh you go with a big one. But then I do think with L_L_M_s now also it will be easier to design something like CUDA. You know like the ne so it took fifteen years because it's hard. But then now we have L_L_M_s we can maybe replicate CUDA.
A separation of the training and the inference compute. As we kind of stabilise a bit more and more and more uh uh computers needed for inference.
that's supposed to be the point of the grok acquisition. And and that's why part of what Vera Rubin is where they have a new chip with no high band-width memory which is one of the or very little, which is one of the most expensive pieces. It's designed for pre-fill which is the part of inference where you essentially do a lot of matrix multiplications and then you only need the memory when you're doing this autoregressive generation and you have the K_V_ cache swaps. So they have this new G_P_U_ that's designed for that specific use case and then the cost of ownership
per per flop or whatever is actually way lower. But I I think that Nvidia's fate lies in the diffusion of A_I_ still. Their biggest clients are still these hyperscale companies, whether it's like Google obviously can make T_P_Us, Amazon is making Tranium, Microsoft will try to do its own things, and like
so long as the pace of A_I_ progress is high, Nvidia's platform is the most flexible and people will want that. But if there's stagnation, then creating bespoke chips, there's more time to do it.
It's interesting that uh Nvidia's is quite active in trying to develop all kinds of different products.
They tried to create areas of commercial value that will use a lot of T_P_Us.
Mm-hmm. But they keep innovating and and there there's a l like they're doing a lot of incredible research, so.
Everyone says that the company is super oriented around Jensen and how
operationally plugged in he is. And it sounds so unlike many other big companies that I've heard about. And so long as that's the culture, I think that I will expect them to keep progress happening. And it's like he's still in the Steve Jobs era of Apple. So long as that is how it operates, I'm pretty optimistic for their situation. Because it's like it is their top order problem. And I don't know if making these chips for the whole ecosystem is the top goal of all these other companies. They'll do a good job, but it
might not be as good of a job.
Since you mentioned Jensen, I've been um reading a lot about history and about singular figures in history. What do you guys think about the single-man-woman view of history? How important are individuals for steering the direction of history in the tech sector? So you know, what's in video without Jensen? You mentioned Steve Jobs. What's Apple without Steve Jobs? What's X_A_I_ without Elon?
Or Deep Mind without Demis.
People make things earlier and faster, where scientifically many great scientists credit to being in the right place at the right time and still making the innovation, where eventually someone else will still have the idea. So I think that in that way Jensen is helping manifest this G_P_U_ revolution much faster and much more focused than without having a person there it would do. And this is making the whole A_I_ build out faster. But I do still
think that eventually like something like chat G_P_T_ would have happened and a build out like this would have happened, but it m probably would not have been as fast. Or like a like I think that's the sort of flavour that is applied.
People these individual peoples are people who are placing bets on something. Some get lucky, some don't. But if you don't have these people at the helm, it will be more diffused. It's almost like investing in a E_T_F_ versus individual stocks. Individual stocks m uh m might go up, might go down more heavily than an E_T_F_ which is more balanced. It will eventually go up over time and we'll get there. But it it's just like you know like focus I think is the thing Uh. passionate focus.
Isn't there a real case to be made that without JETHSEN there's not a uh reinvigoration of the the deep learning revolution?
It could have been twenty years later is the thing that I would say. Or like another A_I_ win like a deep learning winter could have come if G_P_Us weren't around.
history completely. 'Cause uh you could think of all the other t technologies that that could have come in the meantime and the focus of human civilisation could uh the silicon value would be captured by a different hype.
But I do think it is uh I mean there's uh certainly an aspect where it was all planned uh the G_P_U_ trajectory, but on the other hand it's also a lot of lucky coincidences. For example uh all good intuition like the investment into this let's say biophysical simulations or like I mean I think it started with video games and then it it just happened to be good at l linear algebra because video games require a lot of linear algebra and then you have the biophysical um simulations and then but still I don't think the plan the master plan was A_I_ I think there was just it
happened to be Alex Krzyzewski. So someone took these G_P_U_s and like hey let's try to train a neural network on that and happen to work really well and I think it only happened because you could purchase those G_P_U_s.
Mm-hmm.
that's what I would think. Like I think that the G_P_U_s would have been different for the Alex n but I think like G_P_U_s would still exist at the time of AlexNet and at the time of the transformer. It was just hard to know if it would be one company as successful or multiple smaller companies with worse chips. But I don't think that's like a a hundred year delay. It might be a decade delay.
I mean I just can't see Intel or AMD doing what N_V_I_ did on U_C_P_A_.
Mm-hmm. Like silicon graphics or something.
But it does like just look looking at it, it seems like these singular figures, these leaders have a huge impact on the trajectory of the world. Obviously incredible teams behind them. But you know having that kinda very singular almost dogmatic focus is necessary to make progress.
Yeah, I mean even with uh G_P_T_ it wouldn't exist if there wasn't a person, Ilya, who pushed for this scaling, right? I mean Daria was also deeply involved in that. It almost seems wild thinking about how early these people were like we need to hook up ten thousand G_P_U_s and take all of open A_I_s, compute and train one model. There's a lot of people there that didn't wanna do that.
Again singular figures. Speaking of which, hundred years from now, this is presumably post-singularity, whatever singularity is, when uh historians look back at our time now, what technological breakthroughs would they really emphasise as the breakthroughs that led to the singularity? So so far we have touring to today, eighty years.
I think it would still be computing, like the umbrella term computing, just I don't necessarily think it's even like hundred years, two hundred years from now it would be A_I_ it would s uh it could be still well be computers, you know. Just we are now taking better advantage of computers, but like the fact of computing.
It's a basically Moore's law kind of discussion. You're not even the details of code and G_P_U_s won't even be remembered. And it it won't be all the s software turmoil. It'll be just obviously compute.
I would generally agree, but it's like is the connectivity of the internet and compute able to be merged uh or is that both of them?
I think the internet uh will probably be related to yeah, I mean communication, that it it could be a phone internet, uh satellite, that stuff. Um where yeah, and compute is the more like the scaling aspect of it.
It's possible that the internet is completely forgotten. The internet is wrapped into the phone networks, like communication networks. This is i just another manifestation of that. And the real breakthrough comes from the just the increased compute is the Moore's law broadly defined.
Well, I think that connection of people is very fundamental to it. So it's like you can talk to anyone you wanna find the best person in the world or something, they are somewhere in the world. And being able to have that flow of information, the A_I_s will also rely on this. I think I've been fixating on the like um the d when I said the dream was dead about the one central model, and the thing that is evolving is like people have many agents for different tasks. People always start doing this with different clods for different tasks, and it's described as
many A_G_I_s in the data center where each one manages and they talk to each other. And like that is so reliant on networking and free flow of information on top of compute. But like networking, especially with G_P_Us, is such a part of scaling up compute. Like the G_P_Us and the data centers need to talk to each other.
Anything about neural networks will be remembered. Like do you think there's something very specific and singular to the fact that it's neural networks? That seems to break through like a genius that you're basically replicating in a very crude way the human mind, the structure the structure of the human brain, the human mind.
I think without the human mind we probably wouldn't have neural networks because it just uh was an inspiration for that. But at the other end I think it's just so so different. I mean it's digital versus you know biological that I do think it it will probably be more like grouped as an algorithm.
That's massively parallelizable on this particular kind of compute.
Could have be well been like genetic computing, like geni genetic algorithms just as parallelized. I think it just happens that this is more efficient works, better you, know.
And it very well could be that the L_M_ you know the neural networks the way we architect them now is just a small component of the system that leads to singularity.
I think it's if you think of it a hundred years, like society I think can be changed more with more compute and intelligence because of autonomy, but it's like looking at looking at this like what are the things from the Industrial Revolution that we remember. We remember like the engine is probably the equivalent of the computer in this, but there's a lot of other like physical transformations that people are aware of, like like all the like the cotton gin and all these things that these machines that are still known, air
refrigerators, like some of these things from A_I_ will still be known. Like the word transformer could still very well be known. I would guess that deep learning is definitely still learn known, but the transformer might be evolved away from in a hundred years of with A_S_I_A_I_ researchers everywhere. But I th I think deep learning is likely to be a term that is remembered.
And I wonder what the air conditioning and the refrigeration of the future is that A_I_ brings. Is there uh if we travel f forward a hundred years from now, we transport there right now, what do you think is different? How do you think the world looks different? First of all, you think there's humans, you think there's robots everywhere walking around?
I do think specialised robots for sure for, certain tasks. Um that I'm maybe half humanoid. We'll see. I I think for certain things yes, uh there will be humanoid robots because it's just uh amenable for the environment. But uh like for certain tasks it m might make sense. What's harder to imagine is how we interact with uh devices and what humans do with devices will I mean I I'm pretty sure will probably not be the cell phone, will probably not be the laptop, will it be you know implants.
I mean it has to be a brain computer basis, right? I mean a hundred years from now it has to b like given the progress we're seeing now, there has to be unless there's legitimately
complete alteration of how we interact with the reality.
On the other hand, if you think of cars, cars are older than hundred years, right? And it's still the same interface, it's not we haven't replaced cars with something else, we just made the cars better, but it's still steering wheel, it's still wheels, you know.
I think we'll still carry around a physical brick of compute because people want some ability to have a private like you might not have engaged with it as much as a phone, but having something where you could have private information that is yours as an interface between the rest of the internet I think is something that people will still exist. It might not look like an iPhone and it might be used a lot less, but I still expect to have people carry things around.
private for you like encrypted messages encrypted, photos you, know what your life is. Like I guess this is a question on whether how optimistic on brain machine interfaces you are. If th is all of that just gonna be stored in the cloud in your whole calendar? Like it it's hard to think about processing all the information that we can process visually through brain machine interfaces presenting something like a calendar or something to you. Like it's hard to just think about
knowing without looking you know your emailing box. Like you signal to an a computer and then you just know your emailing box. Like what does that like is that something that the human brain can handle being piped into it non-visually? Like I d I don't know exactly how those transformations happen 'Cause. it humans aren't changing in a hundred years.
A local community yeah,
like people you are close to, being able to do things with them and being able to ascribe mean like describe meaning to your life and to be able to do things. I think that that is ma if not in a hundred years, I don't think that human biology is changing away from those on a time scale that we can discuss and I think that like U_B_I_ does not solve agency. I do expect mass wealth and I hope that it is spread so that
average life does look very different in a hundred years, that that's still a lot to happen in a hundred years if you think about countries that are early in their development process to getting access to computing and internet like to build all the infrastructure and to have policy that shares one nation's wealth with another is.
It's I think it's an optimistic view to see all of that happening in a hundred years while they still being while they are still independent entities then not just like absorbed into some international order by force.
But there could be just better, more elaborate, more effective social support systems that help alleviate some levels of basic suffering from the world. You know the, transformation of society where a lot of jobs are lost in the short term, I think we have to really remember that each individual job that's lost is a is a human being who's suffering. That's like a when jobs are lost that scale is a real tragedy. You can make all kinds of arguments about
economics or it's it's all going to be okay, it's b it's good for the G_D_P_, there's going to be new jobs created, m fundamentally the individual level for that human being, that's that's real suffering. That's a real personal sort of tragedy and we have to not forget that uh as the technologies are being developed.
And also my my hope for all the A_I_ slop we're seeing is that there'll be a greater and greater premium for the the fundamental th aspects of the human experience that are like in person, the things that we all like seeing each other talking together in person.
The next few years are definitely gonna be an increased value on physical goods and events and an even more pressure on slop. So it'll be so there'll be keep the slop is only starting. The next few years will be more and more diverse versions of slop.
Mm-hmm.
on it.
even like uh classic examples, I I honestly think this is true, and I I think we'll get tired of it. We are already kind of tired of it. Uh uh same with I mean even art. I don't think art will go away uh b mean you have paintings physical, paintings, there's m more value not, just monetary value, but just more value appreciation for something that is the actual painting than a photocopy of that painting. It could be a perfect digital reprint of that, but there is something when you go to a museum and you look at that art and you see that real thing and you just think about okay, a human I,
it's like a craft you have an appreciation for that and I think the same is true for writing, for talking, for uh any type of experience where uh y it will be I do unfortunately think it will be like a dichomet uh like it will be like a fork where well some things will be automated like you know there are not as many paintings as they used to be m two hundred years ago. There are more more photographs, more photocopies. But at the same time it won't go away. There will be a you know value in that. I think that the difference will just
be a bit you know, what's the proportion of that. But personally I I have a hard time reading things where I obviously see it's um obviously A_I_ generated. I'm like I'm sorry m there might be really good information there, but I have like a certain s nah not for me I, think.
Eventually they'll fool you. And it'll be on platforms that give ways of verifying or building trust. So you will trust that Alexa's not A_I_ generated having been here. So then you have trust in this channel. But it's harder for new people that don't have that trust.
Mm-hmm.
is real, this is not real. There will be some tell ta tell science where you can obviously tell this is A_I_ generated and this is not. But they won't I mean some will be so good that it's hard to tell and then you have to trust and um well that that that will get interesting and a bit problematic.
Mm-hmm. Mm-hmm.
like human editing, which is the opposite of the discussion to try to watermark A_I_ images, and then you can make a Google image that has a watermark and use a different Google tool to remove the watermark. Yeah it's, gonna be Tom's racist. Yeah.
I mean there's also the the c all the capabilities that we've been talking about can be used to destabilize human civilization with even just m relatively dumb A_I_ applied at scale and then further and further super-intelligent A_I_ systems. Of course there's the the the sort of do-mer-take that's important to consider a little bit as we develop these technologies. Um what gives you hope about the future of human civilization? Everything we've been talking about.
Are we going to be okay?
I think we we will. I'm I'm definitely a warrior both about A_I_ and non A_I_ things, but um humans do tend to find a way I. like that's what humans are built for is to have community and find a way to figure out problems and that's what has gotten us to this point. And to think that the A_I_ opportunity in related technologies is really big and I think that there's big social and political problems to
everybody understand that. And I think that that's what we're staring at a lot of right now is like the world is a scary place and A_I_ is a very uncertain thing. And it takes a lot of work that is not necessarily building things. It's like telling people and understanding people that the people building A_I_ are historically not motivated or wanting to do, but it is something that is probably doable and just will take longer than people want. And we have to go through that long period of
like hard to straw A_I_ discussions if we want to have the lasting benefits.
Yeah, through that process I'm especially excited that we get a chance uh to better understand ourselves. Also at the individual level as humans and at the civilization level.
answer some of the big mysteries like what is this whole like consciousness thing going on here seems to be truly special like there's a real miracle in our mind and A_I_ puts a mirror to ourselves and get to answer some of the big questions about like what what is this whole thing going on here?
Well one thing about that is also what I do think uh makes us very different from A_I_ and why I don't worry about A_I_ taking over is like you said consciousness, we humans we decide what we want to do A_I_ in its current implementation and I can't see it changing, you have to tell it what to do. And so you have still the agency it doesn't take the agency from you because you have to you ju it's it becomes a tool you you can think of it as a tool you tell it what to do. It will be more
than other previous tools, it's certainly more powerful than a hammer, it can figure things out, but it's still you in in in charge, right. So the eye is not in charge, you are in charge, you tell the AI what to do, and it's doing it for you.
So in the post-singularity, post-apocalyptic war between humans and machines, you're saying humans are worth fighting for.
A hundred percent I mean in this this is the the movie uh Terminator they made in the eighties essentially, and I do think well the only thing I can see going wrong is of course uh if things are explicitly programmed to do the thing that is harmful basically.
I think actually in that in the Terminator type of setup I think humans win.
Mm-hmm.
I think we're too clever.
Um it's hard to explain how we figure it out, but we do, and uh we'll probably be using local L_L_M_s open source L_L_M_s to help fight the machines. Um I apologize for the ridiculousness. Uh like I said Nathan already knows I've. been a a big fan of his for a long time, been a a big fan of yours Sebastian for a long time, so it's an honour to finally meet you Uh. thank you for everything you put out into the world, thank you for the excellent books you're writing, thank you for teaching us.
Uh and uh thank you for talking today. This was fun.
Thank you for inviting us here and having this human connection uh which is an extremely valuable human connection.
Thanks for listening to this conversation with Sebastian Raschke and Nathan Lambert. To support this podcast, please check out our sponsors in the description where you can also find links to contact me, ask questions, give feedback and so on. And now let me leave you with some words from Albert Einstein.
It is not that I'm so smart but I stay with the questions much longer.
Thank you for listening and hope to see you next time.