Intro
Jessy Lin: What about pre-training or even post-training makes it possible for the models to generalize in these magical emergent ways, and controlling that process so that a company has a set of private data? How do we make the models learn that just as well as the models know the capital of France or how to write Python? So I think it’s a really fun problem to think about.
Sonya Huang: Welcome to Training Data. We are delighted to have Dan Biderman and Jessy Lin, co-founders of Engram, today. Engram is a NeoLab focused on memory and continual learning—two of the hottest topics in all of AI research today. And Shaun and I are delighted to dig in on those topics with you today.
Dan Biderman: Awesome. Happy to be here.
Main conversation
Sonya Huang: Great. So maybe to kick off, the Engram website says, “We don’t see the world through the lens of pre-training or post-training. Our models are always training.” What does that mean?
Jessy Lin: So I think, like, models today obviously know a lot of things. They’re incredibly smart. But we kind of think the bottleneck for making these models more useful these days is not really raw intelligence, but understanding new and evolving contexts. So whether it’s a new task that you’re doing or a particular context for a job or something like this, how do you bake that into the model weights the same way that pre-training and post-training bakes that into the model weights very deeply? And this is kind of why we think of ourselves as working on these fundamental problems of memory and continual learning, which are really two sides of the same coin: How do you make the models learn new things and bake them deeply into the weights of the model?
Sonya Huang: And is your premise then that memory as a separate database or separate thing that you’ve shoved into the context window is not true memory and it’s not true continual learning?
Jessy Lin: I think all of these tools will kind of come together. So these days, the way that people are solving these problems is with context engineering. So you take a huge prompt, maybe you keep talking to the model over many, many turns and hours and reorganize the context to better understand what you’re trying to do. And we think these kinds of things like tool use, context engineering will play a part, but I think an underleveraged tool these days is using the same kind of training pipeline or framework or kind of workflow that the frontier labs are using to make these models really good at frontier math or code, but applying that to every kind of domain, every kind of context that you have, let’s say in a company.
Dan Biderman: Yeah. And to me, it’s like, as an individual, taking notes and having sticky notes is a very valuable thing. We should never discard this. But whenever we get back to business the next day, we always have some sort of trace of memory in our brain, some new intuition about how things should be and where should we look. So these two things should come together. And current solutions are more kind of externalized memory.
And this has two issues. One is that the amount of tokens we will all collectively, individually generate is going to be in the tens of millions of tokens per day soon. So just keeping it and searching through it and rereading it is going to be pretty expensive, but it’s going to also be pretty hard, pretty confusing for the models unless we have major, major breakthroughs in how we do it.
Sonya Huang: Tens of billions of tokens for Shaun. [laughs]
Dan Biderman: That’s good.
Shaun Maguire: Depends on the day.
Sonya Huang: Could you maybe tell us a little bit about the Engram architecture, the Engram product, and how it works?
Jessy Lin: Yeah. I mean, at a high level, I think what we’re trying to do is take any context, like, there’s all these different workspaces, let’s say. So we’re working with partners like Notion and Microsoft and Harvey that have these places where people are doing a lot of work over a long period of time. There’s all this context, both in terms of documents that you’ve already written as a team, as well as now people are interacting with these agents more and more in these products or having conversations, giving them feedback. And figuring out how to have a model that deeply understands that context. So not just reading the files at test time, but really understanding it the way that an employee that’s worked at your company for years has. So you kind of understand at a high level, oh, these are the initiatives across the company. This is the way that we do things. You’ve studied how to run the hiring pipeline or how to do this kind of thing within the company, and can operate just as well as anybody else can in the company.
And so what we’re doing is training per-team models within these workspaces that deeply understand those contexts, and can improve with time on the things that people care about. So the way that we do this at a technical level maybe is training these into weights. So we do a lot of adapter fine-tuning, so adapters of many types. I think people have looked into this for decades at this point, whether it’s LoRAs or prefixes or sparse architectures. I think all of these tools are at our disposal. And then figuring out what the right data is. So how do you turn any kind of raw document or interaction into useful training signal for the model? So again, we have a variety of tools now, like supervised fine-tuning, RL, on-policy distillation, all of these things that the field has developed, and trying to fit these pieces together into a model that learns continuously on the things that people care about.
Dan Biderman: Yeah. And it’s not a bet that tools are not there. Like, our models always work under the assumption that some knowledge is externalized, some tools are always there. But what you need to do is you need to figure out—and that’s the hard task—what needs to be internalized and what can be externalized. And even for stuff that’s externalized, many individuals and companies have their own bespoke tools and ways of doing things. Not everyone has the same Bash CLI tools that the frontier models are training on, and how to get the models to better understand your bespoke setup, I think, is its own interesting thing.
Sonya Huang: And so is the premise then that my Notion agent will be a custom agent that is LoRA fine-tuned or, you know, it’s some way with an adapter tuned so that it’s constantly learning on new content that’s added into my Notion workspace? Is that the premise?
Dan Biderman: Yeah. And they’re working with many models, and they’re the early users of all the frontier models and they’re probably going to keep doing that.
Sonya Huang: Does this approach work on the frontier models or the closed frontier models?
Dan Biderman: We need white box access to the weights, right? So we can partner with companies that have closed source weights and do this with them, but it’s easiest for us to do it with open source models. But any model that’s a transformer model, we can do our thing to it.
Sonya Huang: And what’s the trade-off then when people are comparing the before and after using you? Is it that they’re no longer sending so much context? And so the trade-off is like you burn more compute upfront to learn your company’s way of doing things into the weights, and then you’re sending less context to the model on every inference pass. Is that the rough trade-off?
Dan Biderman: Yeah, that’s one thing. The fact that you don’t have to research things and reread things, and the fact that you don’t have to write monstrous system prompts, that can give you two orders of magnitude reduction in token inference consumption. It’s not like 50 percent or—it can be 100x fewer tokens because many things, especially things that relate to people and teams and organization and priorities, these are things that you can’t really find in one document unless you have it really regimented and document everything. And these kinds of things the model can kind of implicitly learn by training on some of the data and answer within 100 tokens what the best frontier models would consume 100,000 tokens doing. So these kinds of examples are interesting.
And also the quality. You know, there are tasks that are not super natural for the current generation of the models. And we kind of think there’s going to be consistently this gap of, like, three to six months ahead where there’s certain things that are bespoke that people are just exploring; the models are not fully great for them. The models will at some point be great for them, but if you can autonomously learn in a very lightweight way, it will give value in that time in terms of capabilities.
Sonya Huang: Why train on the workspace level versus the individual level, for example?
Dan Biderman: Either is fine for us. It’s just easier to start with—you know, teams of people are more disciplined in how they collect context and in the amount of context they have over years, and it’s easy for us to start there. But every person’s computer and every person’s phone one day is a useful target for our technologies. And in fact, it’ll be very interesting to go there. We just think the big deposits of information are now in teams of people collaborating in knowledge work.
Sonya Huang: Is it a feature or a bug that there is so much fact memorization basically built into large language models? And there’s a school of thought that the models just rote memorizing the fact that the capital of France is Paris is actually a bad thing. And what we would prefer for the models to do is abstractly learn the concepts of countries and capital cities, but not to memorize all these facts in the weights. And so I’m curious what you think about disentangling memorization versus learning, how it’s done in the models today, and then how you’re thinking of approaching it.
Jessy Lin: Yeah, I think it’s a really interesting question. To some extent, you kind of need to remember stuff in order to compose them into more complex concepts. I think the thing that’s kind of missing is figuring out what’s important to remember. And I think even now, when you think about learning new knowledge, if you look at a lot of these academic benchmarks, it’s like, how can we learn very specific facts, like the length of a bridge in this African country? And that’s not something that you really want the models to devote capacity for, and it’s not something that we devote capacity to.
So I think if you look at human memory—I mean, you can say a lot more about this—but it’s lossy because part of the feature of intelligence is compressing what’s important and separating that from what’s not important. And so I think you can’t really separate fact learning from non-fact learning or skill learning, as some people would like to think. If you take a model—and some people have done this with models where you strip out all the facts and just have it the pure core or something like this, it’s very unnatural as a model. It doesn’t know basic things, and you kind of need that. But I think …
Sonya Huang: Why do you need that? Why can’t you look up facts and then just have …
Jessy Lin: I think if you look at how the models think, if you need to recall basic facts in order to take the next step in your thinking, you can’t get very far. Maybe that’s a high-level intuition, but it’s part of the reason why we think training is really important. In order to think more and more complex and deep thoughts about things, you kind of need to internalize something so that you can compose them into more abstract concepts.
Dan Biderman: And there have been efforts before that were hard to scale to try and disentangle the two and pre-train the models in a way that allows it to retrieve and search for things and not internalize them. It’s just the recipe we know to hill climb on collectively right now is this fact pre-training step. And I think the mystery of this approach is that traditionally in CS we would have databases as its own curriculum and we would have algorithms. And the databases is like facts about the world and capitals of whatever, store them, query them. There’s also algorithms of how do you efficiently manipulate information and get some answers in a sample-efficient way.
And I think the magic of deep learning is that these two things are now mushed together, and we need all these smart people and Anthropic interpretability to try and break them apart. And I think a lot of what we’re seeing now in the adoption of AI into the economy is that these things are gradually separating again, where companies have their own context, and they really handle them with care and engineer them with care. And there’s a generic model that’s completely a stranger to these contexts and the model is operating on them. But for us, it’s clear that there needs to be a certain convergence, at least with some cadence, where the facts and the stories and the details are getting mixed into the model.
It has disadvantages as well, because, you know, capitals of countries, they can change, but it’s not very frequent. But there’s many other facts that are changing all the time, and just imprinting them into weights is a challenging thing to do.
Sonya Huang: I see. So you’re saying it’s a false dichotomy to try to separate algorithms from databases here. What really matters is how to distinguish what’s important to remember versus what’s not important.
Jessy Lin: Exactly.
[CROSSTALK]
Sonya Huang: … how we dream. Are you guys taking any inspiration from that in terms of ranking?
Jessy Lin: Very, very loosely, I think. Just the idea that that’s kind of a phase that’s missing, maybe, where you take a context and you deeply internalize it. Right now, it’s like everything happens at test time. You look at the context that the user gives you and you do some thinking on the fly. But again, you can’t get very far, or you can get so far maybe, and you make mistakes along the way. Like, how do you digest that back into the model so that next time you do it, you do it the right way and make even more progress.?
Dan Biderman: Yeah. And what are dreams? Dreams are pretty crazy things. To say we want to build an AI that’s like our dreams sounds a little bit like a nut thing to do. There’s not a lot of coherence there. But what’s interesting there is what happens in our dreams? We see things, we talk to ourselves, and we experiment with the affordances of what can we do and can’t we do in the world, in social situations, and in any—it’s heavily biased towards social stuff, right? So for us too, with the things we’re building is we give the models the time to then go back, retreat from the actual interaction and experiment with its affordances. What can it do in an environment? What does it know? How fast can it handle these kind of tail extreme things, the same ones that we dream about at night?
Shaun Maguire: You guys come from academic backgrounds. What’s a canonical example that motivates this problem or that’s a win so far?
Dan Biderman: Yeah, I have one example—maybe Jessy can give another one—a hypothetical one. For example, imagine one of the AI labs, say OpenAI, has to win some math Olympiad in a week’s time from now. Would they construct a catalog of all the math textbooks and really have people annotate which chapters to get and which graphs to see? Or will they actually collect this, synthesize some training data, launch a training job, see where it lands in five, six days, start evaluating it and stuff like that? So it’s obvious for anyone who’s trained models that there’s a superior way to integrate across the ideas and capabilities, and it involves this kind of magic of training. And we are clear that this has to happen in those high-stake domains of math and coding and cyber and stuff. We just think much of this magic can actually end up in the hands of many more people in interesting ways.
Shaun Maguire: Why isn’t it just the foundation model labs that own the end product here? How do you go between giants?
Jessy Lin: Yeah. So I think the worldview that we have is a bit different from the frontier lab worldview, where it’s like, we want one model that’s bigger and bigger, that’s more and more intelligent across a variety of domains. Instead, how we see it, like, we kind of imagine this world where everybody has their own model. A lot of the things that people want to learn are either private, things that’ll never see the light of day in a post-training dataset, or even conflicting, like, oh, the way that I want to do the task is different from how another company or another individual wants to.
And I think a lot of these things we’re already seeing are hard to train into the models with the same tools that we have used for decades in machine learning, which is you have really clean supervision, you have ground truth reward signals, and you create a nice environment and you train the model to use the tools to better accomplish this coding task. And instead, a lot of the things that actually happen out in the world are very ambiguous, or it’s hard to say what makes something good.
And so I think a lot of these things are very specific to individuals, and I think very kind of misaligned or not very aligned with how the frontier labs think about the whole training pipeline and what kind of models will exist in the longer term.
Dan Biderman: Yeah. And to add to it, I think, what is the P0 for frontier labs? And some of you here are pretty close with them. It’s getting to AGI, getting this one generic model that’s extremely capable in coding and math, and then using it to automate the economy or to solve really hard long-term problems in cryptography and defense or whatever. And it’s pretty clear what needs to happen to push this: more pre-training, bigger models, more data, more RL, more inference time compute, that kind of stuff. That’s P0. That’s where the majority of expenditure and talent goes.
And definitely all of them are thinking about memory and all of them are thinking about continual learning. It’s just more of a product kind of effort right now. We think it deserves its own attention. We think breakthroughs need to happen there. And Demis at the Sequoia event about a month ago said pretty clearly that we need new breakthroughs around these topics. And obviously they’re thinking about them. We’re just focusing exclusively on this.
And we think certain things around incentives of where the data is and who owns the model are pretty interesting. So if you could learn from many humans or organizations at scale without necessarily sending someone to work with them shoulder to shoulder, that would be a pretty big unlock.
Jessy Lin: And maybe another point on that is, I think a lot of things need to look different in the world. So one is that there needs to be new research breakthroughs. Two is new infrastructure for training small models for everybody rather than one big model, one big run. And then the third, I think, is a different way of kind of combining research and products. So right now, I think there’s researchers in these frontier labs, they train the model, they throw it over the fence to the product team who then prompts or context engineers new product surfaces on top of the core models. But in this world where the models are always training, I think the inputs that users provide are very intricately tied to what the models learn from, what the training signal is. And so there needs to be a lot more of an integrated loop between research and product. And so while we’re focused on tackling a lot of the core research challenges, and that’s our background, I think we’re also very focused on, like, how to deploy this as quickly as possible to learn from actual feedback in the real world.
Sonya Huang: What motivated you to work on this problem?
Jessy Lin: I think it’s obviously one of the grand challenges in AI. I think everybody’s talking about it these days, because the models are they’re so smart, so what else is left? I think learning at the edges, learning the remainders of what makes these models useful. It’s not just about raw intelligence anymore, it’s about learning new things. And I think it also feels very fundamental because it kind of goes back to really understanding what makes the model so good.
So right now, the models kind of incidentally know a lot of things from pre-training, and we don’t really understand why. It’s like the internet was just this gift granted to us where there’s a diverse set of data that contains all of these different examples of coding and writing and all these other things. And it just happened that way.
And now to figure out how to crack this problem of continual learning, it’s about figuring out what about pre-training or even post-training makes it possible for the models to generalize in these magical emergent ways, and controlling that process so that a company has a set of private data. How do we make the models learn that just as well as the models know the capital of France or how to write Python? So I think it’s a really fun problem to think about.
Sonya Huang: And Dan, you came from the neuroscience world, is that right?
Dan Biderman: Yes. Yes. So I was initially interested in questions around consciousness and the human condition and things like that.
Sonya Huang: Are the models conscious? [laughs]
Dan Biderman: I don’t have any advanced thoughts on this more than you would read. I don’t think so, but it’s important that smart people are thinking about it. I would say I was interested in how humans think, how humans perceive, and as Amos Tversky, the Israeli psychologist, used to say, he’s not interested in artificial intelligence, he’s interested in natural stupidity. So I would say I started kind of similarly, trying to see how people and animals experience the world.
Gradually, my inclinations took me to the stats and AI domains, and there I figured that so many of the same problems of memory and continual learning are really urgent, and the kind of solutions we have in the current systems are pretty far from what we have in biology. And I’m not one of these people who would say that the machine should be like the animal or the human brain. I don’t think so. There’s many things computers can do better than us, but human memory has these very different things in it. You know, if you want to store a whole code base, you can use a computer. You don’t even need AI on the computer to store everything losslessly and just get it. But the human brain evolved to work in these constraints of information capacity, and to have these fuzzy representations that can then be abstracted and form connections and inform the next day. Current systems don’t really have that beyond the generic pre-training step. And I was really interested in what are ways to build that in, what are ways to learn from that.
Shaun Maguire: This is more of a philosophical question. You mentioned in the brain, there’s a bunch of different real estate, different coprocessing units, whatever. Modern computer architecture, there’s CPUs, GPUs, memory, there’s different coprocessors. With the bitter lesson, do you think that what’s happening is that LLMs are, you know, converged to say one coprocessor that’s just totally dominant. It’s like everything, all compute is going to happen in, you know, the GPU equivalent of like a language model? Or do you think that these models are kind of building a bunch of coprocessors emergently inside the model? And take with memory, like, do you think that the models themselves will just build whatever part of the brain equivalent would be that’s good at memory? Or do you think there needs to be another standalone architecture?
Sonya Huang: Yeah. Is memory an emergent property almost versus a distinct …
[CROSSTALK]
Shaun Maguire: Exactly. And almost everything. Is everything that we need in intelligence will just be emergent with better training data and more scaled compute?
Dan Biderman: Yeah. I would say just on a more superficial perspective on the current deployment of AI, it’s way more than just GPUs. And we’re seeing all these sandboxes exploding and models operating on other computers trying things.
Shaun Maguire: I more mean on the model architecture level rather than on the …
Dan Biderman: Yeah. So other experiments—there have been many previous experiments on different architectures that we contributed to, like the state-space family and others to try and handle very, very long contexts more efficiently. The thing with all these methods is it ends up being a trade-off, usually a trade-off between memory and accuracy, and memory not in the behavioral cognitive sense, memory in the computer sense, right? Instead of having the memory footprint of the transformer attention, which is quadratic in the sequence length, these models have …
Shaun Maguire: Some are claiming they have sub-quadratic.
Dan Biderman: Yeah, some are claiming, and some do have it, right? And some of the best Chinese models have layers that are inspired by those state-space architectures and are not quadratic in cost. The thing is that in our hands we find that you always compromise accuracy for this memory. There’s no free lunch. And what we’re saying is like, look, if you’re really bitterless and pilled, what you want to do is you want to think, how can I burn more compute, and how can I burn it on new contexts that I have not seen before? So we’re as bitterless and pilled as anyone else, and we are not betting that the overall direction of AGI is going to end anywhere soon. We just think there’s more compute to scale. And if I truly want to understand Shaun and Shaun’s work and Shaun’s context, just rereading files is not going to make it, especially for a special person like you.
Shaun Maguire: “Special” is derogatory.
Dan Biderman: We got to train 100 trillion parameters for this guy.
Sonya Huang: Co-sign, co-sign.What are you finding that people care most about their models learning? Is it memorizing facts about the organization? Is it remembering, like, ah, no, we do CI this way? What are people actually hoping to—and then maybe this feeds into how you do the ranking of memory slots and all that.
Jessy Lin: Yeah. Well, I think if you look at what people are spending their time in the app layer doing these days, it’s a lot of just trying to make the model work well for your use case. Like, oh, I want the model to, let’s say, design my website with my brand style. That’s a very common example these days, but there’s many kinds of different tasks that people do with agents, like learning how to run a workflow, or your particular way of writing, let’s say. So there’s many, many kinds of things. And honestly, I think when we think about these methods, kind of going back to this distinction between facts and skills, there really is none. I think the methods are kind of agnostic to that.
Dan Biderman: Yeah. To me, it’s like the natural thing. Almost all the app layers are basically a frontier model wrapped in a loop with search tools and stuff. And what they’re all interested in doing with us is finding ways to kind of interface with their data in a way that’s faster, more efficient, and also is more contextual. So almost all of them, it’s like, we want to have our firm knowledge be encoded in something that’s more efficient that I don’t have to research. We want to have the model know in a targeted way who’s the person I should triage a thing to. And we’re just showing them that with pretty lightweight training, these things can be instinctual to the models. They don’t have to have these very involved, long REPL loops to solve them. So in a sense, it’s like a RAG killer kind of thing. Again, we can always do RAG and we can always retrieve, but that’s the thing that people are interested in: interfacing with very large data planes and automating very repetitive things this way.
Sonya Huang: And I want to double-click on this RAG killer thing, and I’m sorry to beat a dead horse. I just don’t fully grok it yet. Is the premise that there’s some trade-off between doing RAG versus updating your model weights? Is the idea that you should be doing both? Like, what types of things should be done in the weights versus what types of things should be externalized to RAG?
Dan Biderman: I think it’s an unsolved problem. I don’t think anyone has an answer to it. We’re all working on it. It’s also the fundamental question of biological memory, what should be internalized versus what’s not. I do think that things that are like, you know, do you need to internalize the room number in a hotel that you were in a year ago? Probably no, not in your neural tissue. Probably that’s good to write down, but do you need to internalize maybe the password to your home right now? Probably it’s useful for the next few years to have that imprinted somewhere.
So yeah, how does this translate into knowledge work and products? This is still something we figure out, and we try to take the approach that we try to use as few heuristics as possible. It’s easy to run filters on the data and say, I’m going to keep this, discard that, train on this, train on that. But as humans, we watch TikTok and we get exposed to a lot of garbage, and still the brain is able to learn and not completely go off the rails, and we think models should be the same as well.
Jessy Lin: Yeah. Maybe concretely in the short term, I think a lot of what people are worried about these days is the huge inference costs of running these agents for days on end.
Sonya Huang: High inference cost is a good thing. [laughs]
Jessy Lin: I mean, consuming tokens for what?
Shaun Maguire: Sonya works at Fireworks. She really loves inference.
Sonya Huang: I love inference.
Dan Biderman: We love inference.
Jessy Lin: Yeah. So I think it’s like, in the short term, that’s the immediate pain point. Why are you reading the same files over and over again, even in the same query? But definitely across people in the same company, they’re running the same queries on the same documents over and over again. And that should be something the model just knows. In the same way you ask an employee, they don’t type into the search box, “What was I working on yesterday?” They just know.
Sonya Huang: But doesn’t caching kind of solve that?
Jessy Lin: I think to some extent, yeah. But I think going back to this question of what should be internalized versus what’s something you retrieve at test time, I think again, a lot of it is about building on your knowledge. So if you are always doing RAG, you can’t make associations like, oh, I see somebody on the team is doing this kind of research, and I recall at an abstract level, oh, there’s this related thing that you might want to know about. You didn’t even ask about it, right? But I think these kinds of associations can only happen in weights because they’re not really about, you know, you asked me to search for this, I’m going to search for this.
Dan Biderman: And also, I think the main limitation with retrieval systems in general and AI specifically is the problem is not so much what to store and where to put it. The problem is how to address it, how to query the thing. Do you know what to look for even?
Jessy Lin: Yeah.
Dan Biderman: And this involves some sort of intuition that sometimes the models don’t have, interestingly enough. They don’t know where to look. And especially if you’re limited to the current way of doing things, which is keyword search, that is just easier to scale in RL and least involved in terms of infra for embeddings and stuff. So yeah, knowing what to search is something that’s intuitive and can happen in the weights.
And also about caching and inference, much of this company started with us taking a deep dive into KV caches and caching. And this is a fascinating thing, right? KV cache is a monstrosity of the current way of doing things that—you know, think about it, a KV cache for a single Wikipedia article for some Taylor Swift or something like this, it will be like 80 gigabytes of HBM memory on the GPU. And an entire LLaMA, for, say, a 70B LLaMA model, and the entire weights of the model would be about 100 gigabytes. And with some distortion, they remember the entire internet. And how come one thing is so bit efficient? And we have this proof of existence that gradient descent can pack a lot of information in very few numbers, whereas this KV cache thing, you take a few tens of kilobytes of article and it becomes those 80 gigabytes of brain state. So sure, you can cache this, you can load this, you’ll have issues with disk-to-HBM stuff. People are working on it, it’s pretty interesting. But what if we can take those 80 gigabytes, spend some compute offline, maybe also in Fireworks, but then compress it and make it really, really small so that the thing we load in cache is 1,000x smaller? That would have tremendous implications for how we load things, how fast we can do things, and what the fidelity of their representation is.
Sonya Huang: Super interesting.
Shaun Maguire: What are some of the things that could happen in the next year or two that would be like the ChatGPT moment of memory? Or do you think that that’s not how things will play out?
Jessy Lin: It’s a good question. I don’t know, I think the first proof of concept of the thing that people keep talking about with continual learning, which is you have an intern that you can teach things over time and it actually gets better. I think everybody’s waiting to see that. And no matter how sophisticated the context engineering approaches are these days, they’re not getting there. So I think you need all of these tools at your disposal to make that happen. But I think it will be something like that where it’s like the model’s actually getting smarter. Like, whoa, it’s different from yesterday.
Dan Biderman: Yeah. And it’s important to say that the ChatGPT model was not anticipated. We just read about all the …
[CROSSTALK]
Dan Biderman: … product directions that certain people had before ChatGPT was different. I feel like to me, the example is like, look, if you resigned from your job today and your sole mission was to make a model that’s better for you, and you would use OpenAI, Anthropic and all these frontier models, and you would just 24/7 engineer the context right skills, your way to move the needle is very limited as an individual. You’ll just be better off waiting for the next version of the model and you’ll take it from there. And we would like to see a future where actually the more time you spend on the thing actually translates to the quality of performance, at least in the things and domains you care about. And this is pretty hard to achieve. And the only reason we think it could be achieved is if you start scaling compute and training on these data without destroying the model, importantly, which is pretty hard.
Shaun Maguire: This is just for fun, rapid-fire questions going off just memory. When’s the last time you were each surprised about something in AI, in any area?
Dan Biderman: When reading about fundraising.
Sonya Huang: [laughs]
Dan Biderman: A lot of surprises every day. I would say all of us felt a little bit of a change around the capabilities of the coding agents.
Jessy Lin: That’s true.
Dan Biderman: But we’ve been dabbling with these things and trying to make them work in more effortful ways before, so it didn’t come as a complete surprise. But yeah, I think to me the main events were GitHub Copilot. That for me was just the main event. And ChatGPT. And then seeing the agentic stuff, we all anticipated, I think, and different people had different expectations on how far it can go and how long horizon it can go. But I feel yeah, we’re yet to see something fundamentally different. And people are working on completely new ways of doing things now. But yeah, to me, it’s models actually changing in a way that’s not harmful, and learning new things on the fly that are personally and economically viable. That’s interesting.
Sonya Huang: Right now, there’s this idea of, like, we’re each going to have a token wallet that we’re going to bring around to companies, or to different apps, different workspaces. Do you think that we’re going to end up with a memory bank, a memory wallet that we’re going to move across the digital world as we go?
Jessy Lin: I think it’s an interesting question. I don’t know if we’ve fully figured out what the right kind of product form factor is in this sense. In a way, even with ChatGPT memory, let’s say, I kind of don’t want it to remember across my personal and work contexts.
Sonya Huang: Oh, yeah.
Jessy Lin: It’s like, “Oh, you might like these sheets because you trained a model on a GPU last week.” It’s like, that’s totally irrelevant. And to some extent it’s because the memory is flawed, but also I think you do want memory in your tools and the products that you use to be separated, to have control over that. So I personally think there needs to be some separation there, but I guess it’s to be determined what that might look like.
Dan Biderman: Yeah. And I think a holy grail is you go to work, and you just burn through all these tokens and you create all this value. And somehow, you know, all the IP and stuff stays with the company, but somehow the skills you learned, the things you invented, your ways of doing things, some of them you can take with you as well to your next job in a way that’s sanitized and not harmful to any other company’s IP.
So I do think, like, carrying a set of skills will be interesting. We do it in our biology right now and we just sign NDAs and have ethical rules around it. But I think doing it in a digital world would be pretty interesting, and pretty rewarding, because it will force each of us to push the frontier and implement AI more deeply in our companies, in our individual life, and then be rewarded for it.
Shaun Maguire: I started a PhD in [inaudible] in 2007 at Stanford, and AI was boring as hell at the time. It was all statistical learning. And there’s basically two areas: computer vision and NLP. So vision and language were kind of the two areas, and I think that’s still true. In 2012, AlexNet happened, like, vision was dominating for six years or whatever. Are you guys surprised that language seems to be—like, the language approach seems to be dominating over vision in progress? Question two, do you think vision has any chance of coming back? How do you think about this?
Jessy Lin: Yeah, I think it is pretty surprising to me. I mean, some people maybe saw it coming, but I think I’ve always kind of been interested in language as, I don’t know, I guess a medium for communication. And so many kind of complex abstract things can be done in language. I do imagine in the longer term, language and vision will kind of combine in this more unified system where we kind of take in inputs from all of these different modalities and understand them in this abstract way.
Dan Biderman: Yeah, to me, I’ve never been interested in language. It seemed to me such an advanced capability that is very—the entire animal kingdom has very different forms of speech and language than what we—and how we communicate with ourselves and writing. And I always, as many other leaders in AI, had this thought that the natural thing is you have to experience the world, act in it, and vision in action, that will be the key. But then I’ve, like anyone else, seen the ChatGPT moment, and went to do some work at Mosaic and stuff like that to learn how the sausage is made on the NLP side. And the thing that’s striking is that the language should be pretty hard. Each word has this one hot embedding vector that’s as dissimilar to any other word than it is to—you know, it’s a completely high-dimensional space and it’s really artificial in a sense. And we learn it with models that are order of magnitude bigger than the best vision models. And still, things work pretty well. I do think there’s a lot of juice to be squeezed in image and video, and I think you guys are doing good investments in this space. But I think the two would keep being interesting in different ways.
Shaun Maguire: I mean, now I’ll tell you my—that was my lead-up. Now I’m gonna tell you the crackpot theory.
Sonya Huang: [laughs]
Shaun Maguire: And this podcast is not for me to pontificate, it’s for you guys, but this is something I’ve been thinking a lot about and you’re the right people to share this with. I was pretty shocked that language kind of surpassed vision, and I underestimated what was happening with LLMs in 2018, 2019, 2020, because I just had this bias towards vision. And when I look back on it now, like, I think what’s basically happening is that in biology, vision has a massive fundamental advantage over language. And maybe I’m wrong, but basically the bitrate that your brain can process optical data through the eye is—and I’m not a biologist, this is just kind of my dumb assessment—seems many orders of magnitude greater. And there’s a lot of optical processing that happens even before you reach electrons. And so it’s just like the total bitrate of training data that’s kind of being processed and then making it to your brain seems many orders of magnitude greater than the audio data where, you know, it’s sound waves, where sound waves are fundamentally much slower bitrate than light.
And then there’s almost like an upscaling from the acoustics to electronics, which make it into your brain, whereas there’s like a downscaling from photons to electrons with vision, whereas in computers today, everything is electronic. So it’s kind of like you nerfed vision and you promoted language where all processing is on the same playing field. It’s all electronic. And I think this is my crazy-ass, dumb, non-technical crackpot theory, but I think this might be part of why, just from an information theory perspective, that maybe language and vision are on a similar playing field by the time you get to LLMs. And then LLMs were just a really smart architecture that’s better suited for language than for vision. How dumb does this sound, especially to you, Dan, the neuroscientist?
Dan Biderman: Jessy also has some background in cognitive computational science, right?
Jessy Lin: Yeah.
Dan Biderman: So I would say my point here is like, look, much of what we’re doing in knowledge work, we haven’t evolved to do, right? We’re sitting on these computers reading these things, writing these memos, whatever. We are not evolved to do this. It’s new to us, our brains are not wired for this. Still, nevertheless, it’s useful to have LLMs to do this for us. And as humans, we’re heavily vision biased. You know, other rodents are more olfactory bias, and I’ve worked on these things myself before. So what’s the real estate in the brain that’s allocated to vision and, you know, occipital lobes versus, like, language areas, temporal lobe, probably more vision. I’ll have to check with ChatGPT, but I think that’s the situation.
Shaun Maguire: You don’t know from memory?
Dan Biderman: No, man. I’m externalizing. I’m a big RAG believer in my personal lifestyle.
Shaun Maguire: In the limit, we’re all—it’s all RAG.
Dan Biderman: I internalize just important things like my emotions to you. No, just kidding. Anyways, yeah, and vision is dominating. When people are training vision-language models, language ends up dominating the vision content there. But yeah, it’s hard to say that, because a certain brain is more biased towards a certain modality doesn’t mean necessarily that we’re going to more efficiently do it. I do think that efforts on brain-computer interfaces should take this into account. How do you then relay it back to the brain? That’s where I think it’s really important to think, like, what real estate do we have there right now? But for knowledge work, it’s equally fine if it’s text, I think.
Sonya Huang: Last question. If everything goes right, what does the world look like in five, ten years? And then what is Engram’s role in it?
Jessy Lin: I think I’m imagining a world where everyone has their own model that is really different from the other person’s model and from the frontier model. And all of these kind of serve different purposes. And to have a model that really—you know, I think people often talk about knowing you, but also helping you in the ways that make sense to you personally, whether it’s an individual or a team. I think there’s an element of having different kinds of intelligence everywhere.
Dan Biderman: Yeah. And to me, actually, it’s a variant of the story where in neuroscience, we know that memory and navigation are pretty closely related, same circuits in the brain that represent landmarks in space are in charge of some elements of episodic memory and things like this. And for me, I think the company can be the actual LLM interface to the data plane for everyone. So sharing some similarities to great companies like Databricks and Oracle, where we form these memories that happen to be neural memories with models that happen to be personalized and happens to be there’s hundreds of millions of them, but they’re basically a neural interface in the data plane in a way that’s very different from what we know. And it’s more efficient, it’s more associative, it’s not representing the file system as it is, it’s representing a brain state of that file system. So that’s for me a vision.
Sonya Huang: Beautiful vision to end on. Thank you guys so much for coming by to share what you’re building.
Dan Biderman: Awesome. Love it.
Sonya Huang: Thank you.
Shaun Maguire: Thank you, guys.
