text.txt

 Hi, I'm Craig Smith and this is I on A On. This week I talked to Jan LeCoon, one of the seminal figures in deep learning development and a long time proponent of self-supervised learning. Jan spoke about what's missing in large language models and about his new joint embedding predictive architecture which may be a step toward filling that gap. He also talked about his theory of consciousness and the potential for AI systems to someday exhibit the features of consciousness. It's a fascinating conversation that I hope you'll enjoy. Okay, so Jan, it's great to see you again. I wanted to talk to you about where you've gone with so supervised learning since last week spoke. In particular, I'm interested in how it relates to large language models because the large language models really came on stream since we spoke. In fact, in your talk about JEPA, which is joint embedding predictive architecture. There you go. Thank you. You mentioned that large language models lack a world model. I wanted to talk first about where you've gone with self-supervised learning and where this latest paper stands in your trajectory. But to start, if you could just introduce yourself and we'll go from there. Okay, so my name is Jan Le Ka or Jan Le Koon who want to do it in Gilleswee and I'm a professor at New York University and at the Quarantine Institute in the Center for Data Science. And I'm also the chief AI scientist at Fair, which is the fundamental AI research lab. That's what Fair stands for. Admetta, Neil, Facebook. So tell me about where you've gone with self-supervised learning, how the joint embedding predictive architecture fits into your research. And then if you could talk about how that relates to what's lacking in large language models. Okay, self-supervised learning has been, has basically brought about a revolution in natural language processing because of their use for pre-training transformer architectures. And the fact that we use transformer architectures for that is somewhat orthogonal to the fact that we use self-supervised learning. But the way those systems are trained is that you take a piece of text, you remove some of the words, you replace them by black markers, and then you train the very large neural net to predict the words that are missing. That's a pre-training phase. And then in the process of training itself to do so, the system learns good representations of text that you can then use as input to its subsequent downstream task, I don't know, translation or Hitchbitch detection or something like that. So that's been a career revolution over the last three or four years. And including in sort of very practical applications, like every sort of type of performing contact moderation systems on Facebook, Google, YouTube, et cetera, use this kind of technique. And there's all kinds of other applications. Now, large language models are partially this, but also the idea that you can train those things to just predict the next word in a text. And if you use that, you can have those system generate text spontaneously. So there's a few issues with this. First of all, those things are what's called generative models in the sense that they predict the words, the information that is missing, words in this case. And the problem with generative models is that it's very difficult to represent uncertain predictions. So in the case of words, it's easy because we just have the system produce essentially what amounts to a score or a probability for every word in the dictionary. And so it cannot tell you if the word missing in a sentence like the blank chases the mouse in the kitchen. It's probably a cat, could be a dog, but it's probably a cat, right? So you have some distribution of probability over all words in the dictionary. And you can handle uncertainty in the prediction this way. But then what if you want to apply this to let's say video, right? So you show a video to the system. You remove some of the frames in that video and you train you to predict the frames that I'm missing. For example, predict what comes next in a video and that doesn't work. And it doesn't work because it's very difficult to train the system to predict an image or whole image. We have techniques for that for generating images before actually predicting good images that could fit in the video. It doesn't work very well. Or if it works, it doesn't produce internal representations that are particularly good for downstream task like object recognition or something of that time. So attempting to transfer those SSL method that are successful in LP into the realm of images has not been a big success. It's been somewhat of a success in audio. But really the only thing that works in the domain of images is those generating architectures where instead of predicting the image, you predict a representation of the image, right? So you feed. Let's say one view of a scene to the system. You run it to something on that that computes a representation of it. And then you take a different view of the same scene. You run it through the same network that produces another representation and you train the system in such a way that those two representations are as close to each other as possible. And the only thing the systems can agree on is the content of the image. So they end up including the content of the image independently of the viewpoint. The difficulty of making this work is to make sure that when you show two different images, it would produce different representations. So to make sure that there are informative value inputs and your system didn't collapse. I've just produced always the same representation for everything. But that's the reason why the techniques that have been generative architectures have been successful in LP aren't working so well. In images is their inability to represent complicated complicated uncertainties if you want. So now that's for training a system in SSL to learn representations of data. But what I've been proposing to do in the position paper I published a few months ago is the idea that we should use SSL to get machines to learn predictive world models. So basically to predict where the world world is going to evolve. So predict the continuation of a video, for example. Possibly predict how it's going to evolve as a consequence of an action that an intelligent agent might take. Because if we have such a world model in an agent, the agent being capable of predicting what's going to happen as a consequence of its action will be able to plan complex sequence of actions to arrive at a particular goal. And that's what's missing from all the pretty much all the AI systems that everybody has been working on. Or has been talking about loudly. Except for a few people who are working on robotics or it's absolutely necessary. So some of the interesting work there comes out of the robotics community, the sort of machine learning and robotics committee. Because there you need to have the skip ability for planning. And the work that you've been doing is it possible to build that into a large language model or is it incompatible with the architecture of large language models. It is compatible with large language models. And in fact, it might solve some of the problems that we're observing with large language models. One point is large language models is that when you use them to generate text, you initialize them with a prompt. So you type in an initial segment of a text, which could be in the form of a question or something. And then you hope that it will generate a consistent answer to that text. And the problem with that is that those systems generate text that sounds fine grammatically. But semantically, but sometimes they make various stupid mistakes. And those mistakes are due to two things. The first thing is that to generate that text, they don't really have some sort of objective. But then just satisfying the sort of statistical consistency with the prompt that was typed. So there is no way to control the type of answer that will produce. At least no direct way, if you want. That's the first problem. And then the second problem, which is much more acute, is the fact that those large language models have no idea of the underlying reality that language describes. And so there is a limit to how smart it can be and how accurate it can be because they have no experience of the real world, which is really the underlying reality of language. So their understanding of reality is extremely superficial and only contained in whatever is contained in language that they've been trained on. And that's very shallow. Most of human knowledge is completely non-linguistic. It's very difficult for us to realize that's the case. But most of what we learn is nothing to do with language. Language is built on top of a massive amount of background knowledge that we all have in common, that we could come in sense. And those machines don't have that, but a cat has it, a dog has it. So we're able to reproduce some of the linguistic abilities of humans without having all the basics that a cat or dog has about how the world works. And that's why the systems are. I have failures is actually. So I think what we would need is an ability for machines to learn how the world works by observation in the manner of babies and infants and young animals. Accumulate all the background knowledge about the world that constitutes the basis of common sense if you want. And then use this word model as the tool for being able to plan sequence of actions to arrive at a goal. So setting goals is also an ability that humans and many animals have. The settings of goals for arriving at an overall goal and then planning sequences of actions to satisfy those goals. And those my goals don't have any of that. They don't have an understanding of the learning world. They don't have a capability of planning for planning. They don't have goals. They can just send send themselves goals other than through typing a point, which is a very weak way. Where are you in your experimentation with this? Jet architecture. So pretty early. So we have forms of it simplified form of them that we call joint-time meeting architectures without the P without the predictive. And they work quite well for learning representations of images. So you take an image you destroy it a little bit and you train on that to produce. Essentially we're also identical representations for those two distorted versions of the same image. And then you have some mechanism for making sure that it produces different representations for different images. And so that works really well. And we have simple forms of. Jet out the predictive version where the representation of one image is predicted from the representation of the other one. One version of this was actually presented that nervous this is called V. Craig L. For local and. It works very well for training that neural net to learn representations are good for image cementation for example. But we're still working on. It recipe if you want for a system that would be able to learn. The properties of the world by watching videos understanding for example very basic concepts like the word is three dimensional. The system could discover that the world is three dimensional by being shown video with the moving camera. And the best way to explain how the view of the world changes as the camera moves is that every pixel does a depth that explaining products motion, et cetera. Once that concept is learned then the notion of objects and. Occlusion objects are in front of others naturally emerges because objects are. Part of the image that moved together with products motion. At least in animated objects. Animate objects are objects that move by themselves so that could be also a natural distinction. This ability to spontaneously form the categories the babies do this at the age of a few months. They have an idea without having the names of anything they know right. They can tell a car from a bicycle chair table a tree excited. And then on top of this you can build. Notions of into the physics the fact that objects are not supported with all for example the babies on this at the age of nine months roughly it's pretty late and inertia six things are that type. And then after you've acquired those basic. Knowledge background knowledge about how the world works then. You have pretty good ability to predict and you can also predict perhaps the consequence of your actions when you start acting in the world. And then that gives you the ability to plan perhaps it gives you some basis for common sense. So that's the progression that we need to do we don't know how to do any of this yet. We don't have a good recipe for training system to predict. What's going to happen in the video for example within any degree of usefulness. Just for the training portion how much data would you need it seems to me you would need a tremendous amount of data. We need a couple hours of Instagram or YouTube that would be enough really the amount of data of raw video data that's available. It's incredibly large if you think about. Let's say five year old child and let's imagine that this five year old child can. Usually analyze. Visual percept maybe ten times a second. Okay so that's ten frames per second. And if you count how many seconds they are in five years it's something like 80 millions. So the child is in a hundred eight hundred million frames right or something like that if you yeah it's an approximation is a billion. It's not that much data we can have that tomorrow by just recording like saving a YouTube video or something. So I don't think it's an issue of data I think it's more an issue of architecture training paradigm. Principles mathematics and principles on which to base this one thing of cities. If you want to solve that problem. Abandon five major pillars of machine learning one of which is those generative models and to replace them with those joint embedding architectures. A lot of people envision already convinced of that. Then to abandon the idea of doing probabilistic modeling so we're not going to be able to predict to represent usefully the probability of the continuation of a video from. Condition on what we have already observed we have to be less ambitious about or mathematical framework if you want. So I've been advocating for many years to use something called energy based models which is a weaker form of modeling under a certain tea if you want. Then there is another concept that has been popular for training joint embedding architectures over the last few years. Which had the first paper on in the early 90s actually on something called same is networks. So it's called contrastive running and I'm actually advocating against that to. So used to this idea that once in a while you have to cover up new ideas and. And it's going to be very difficult to convince people who are very attached to those ideas to abandon them, but I think it's time for that to happen. Once you've trained one of these networks and you've established a world model. How do you transfer that to the equivalent of a large language model. One of the things that's fascinating about the development of LLM's in the last couple of years is that they're now multi model. They're not purely text and language. So how do you combine these two ideas or can you or do you need to. Yeah, so there's two or three different questions in that one question. One of them is. Can we usually transform existing language models. Whose purpose is only to produce text in such a way that they have. They can do the planning and objectives and things like that. The answer is yes, that's probably fairly simple to do. Can we can we train language model purely on language and expected to understand the underlying reality and the answer is no. And in fact. I have a paper on this in a. All places a philosophy magazine called noina, which I co-wrote with. A car carrying philosopher who is a post document about NYU. Where we say that there is a limit to what we can do with this because most of. A human knowledge is non linguistic and. If we only train systems on language, they will have a very superficial understanding of what they're talking about. So if you want systems that are robust and work, we need them to be grounded in reality. And it's an old debate whether they are actually being grounded or not. And so the approach that some people have taken at the moment is to basically. Turn everything including images and audio. Into text or something similar to text. So you take an image, you cut it into little squares, you turn those squares into vectors that's called tokenization. And now the image is just a sequence of tokens. The text is a sequence of words, right? You do this with everything and you get those multiple systems and they do something. Okay, now clear. That's the right approach long term, but they do something. I think the ingredients that I'm missing there is the fact that I think if we're dealing with sort of continuous type data like video. We should use the joint embedding architecture, not the generative architectures that large language models currently use. First of all, I don't think we should tokenize them because a lot of it get lost in translation when we tokenizing edges and videos. There's a problem also which is that those systems don't scale very well with the number of tokens you feed them with. So it works when you have a text and you need a context. To predict the next word that is maybe the 4000 last words, it's fine. But a 4000 tokens for an image or video is tiny like you need way more than that and those systems scale horribly with the number of tokens you feed them. So we're going to need to do a lot of new innovations in architectures there. And my guess is that we can't do it with generative models. I have to do the joint embedding. How does a computer recognize an image without tokenization? So, conditional nets for example, don't tokenize. They take an image as pixels, they extract local features, they detect local motifs on different windows on the image that overlap. And then those motifs get combined into other slightly less local motifs. And it's just kind of hierarchy where representations of larger and larger parts of the image are constructed as we go up in the layers. But there's no point where you cut the image into squares and you turn them into individual vectors. It's more sort of progressive. So there's been a bit of a back and forth competition between the transformer architectures that tend to rely on this tokenization and convolutional nets which which don't or in different ways. And my guess is that ultimately what would be the best solution is a combination of the two where the first few layers are more like convolutional nets. They exploit the structure of images and video certainly. And then by the time you get to up to several layers, they are the representation is more object based. And there you have an advantage in using those those transformers. But currently basically the image transformers only have one layer of conclusions at the bottom. And I think it's a bit of a waste and it doesn't scale very well when you want to apply the video. On the timeline, this is all moving very fast. It's moving very fast. How long do you think before you'll be able to scale this new architecture? It's not just scale is actually coming up with a good recipe that works that would allow us to just plug a large neural net or the smaller one. On on on YouTube and then learn how the work works by watching in a video. We don't have that recipe. We don't have probably don't have the architecture other than some vague idea, which I call hierarchical japa. But there's a lot of details to figure out that we haven't figured out this probably failure mode that we haven't yet encountered that we need to find solutions for. And so I can't give you a recipe and I can't tell you if welcome up with the recipe in the next six months year, two years, five years, ten years. It could be quick or it could be much more difficult than we think. But I think we're on the right path in searching for a solution in that direction. So once we come up with a good recipe. Then it will open the door to new breed of AI systems. Essentially that can. They can plan they can reason. And. It will be much more capable of having some level common sense perhaps. And have forms of intelligence that are more similar to what we observe being in animals and humans. Our work is inspired by the cognitive processes of the brain. Yeah. And that process of perception and then informing a world model. Is that confirmed in neuroscience? It's a hypothesis that is based on some evidence from both neuroscience and cognitive science. So I what I showed is proposal for what's called a cognitive architecture, which is some sort of modular architectures that. Would be capable of the things like like planning and reasoning that we observe in capabilities that we observe in animals and humans. And that the current most current AI systems except for a few robotics systems don't have. So I think that's important in that respect. But it's more of an inspiration really than a sort of direct copy. Interested in understanding the principles behind intelligence, but I would be perfectly happy to come up with some particular that is that uses backpropadial level but. At a higher level kind of does something different from the supervise running or something like that, which is why I work on self-supervised. And so I'm not necessarily convinced that the path towards the satisfying the goal of talking about of learning world models, etc. Necessarily goes through finding biological, plausible. Running procedures. What did you think of the forward forward algorithm and were you involved in that research? I was not involved, although I've thought about things that are somewhat similar for many decades, but very few of which is actually published. It's in the direct line of a series of work that Jeff has been very passionate about for four years of new learning procedures of different types for basically local learning worlds that can train fairly complex neural nets to learn good representations. And things like that. So he started with the boss machine, which was a really interesting concept that turned out to be somewhat in practical, but very interesting concept that a lot of people started. Backprop, which of course, he and I both had in developing something I worked on also simultaneously with backprop in the 1980s, the called target prop, where it's an attempt at making backpropadial local by computing a virtual target for a new model. Every neuron in a large neural net that can be locally optimized. Unfortunately, the way to compute this target is normal calls. And I haven't worked on this particular type of procedure for a long time, but Yoshua Benjou has published a few papers on this over the last 10 years or so. Yoshua Jeff and I when we started the deep learning conspiracy in the early 2000 to renew the interest of the community in deep learning. We focused largely on forms of kind of local self supervised learning methods. So things like in just case that was focused on restricted boss machines. Yoshua settled on something called denozing auto encoders, which is the basis for a lot of the large language model type training that we're using today. I was focusing more on what's called sparsato encoders. So this is different ways of doing training a layer if you want in the neural net to learn something useful without being it without it being focused on any particular task. So you don't need label data. And a lot of that work has been put aside a little bit by the incredible success of just pure supervised learning with very deep model we found ways to train very large neural nets with with very many layers with just back prop and so we put those techniques on the side and Jeff basically is coming back to them. I'm coming back to them in different form a little bit with this so the japa architecture. And he also had ideas in the past something called recirculation. A lot of informax method which actually the japa use this thing ideas are similar. He's a very productive source of ideas that are that sometimes seems out of the left field. And where the community is attention and then doesn't quite figure it right away and then it takes a few years for those things to disseminate and sometimes they don't just a minute. Hello. Beauregard I'm recording right now. Who? Rasmus I'll answer when I get back. Yeah, you'll be famous someday. Okay, okay, great. Thanks very much. Yeah, bye bye. Sorry about that. There was a very interesting talk by David Chalmers at some level. It was not a very serious talk because everyone knows as you described earlier that large language models are not reasoning. They don't have common sense. He doesn't claim that they do. No, that's right. What you're describing with this Japa architecture, if you could develop a large language model that is based on a world model. You'll be a large language model. You'll be a world model. At first, it would not be based on language. You'll be based on visual perception, maybe audio perception. If you have a machine that can do what a cat does, you don't need language. Language can be put on top of this. Language is easy, which is why we have those large language models. We don't have systems that run how they work. Yeah, but let's say that you build this world model and you put language on top of it so that you can interrogate it, communicate with it. Does that take you a step toward what Chalmers was talking about? I don't want to get into the theory of consciousness, but at least an AI model that would exhibit a lot of the features of consciousness. David actually has two different definitions for sentience and consciousness. You can have sentience with our consciousness. It's simple anymore, our sentience. In the sense that they have experience, emotions, and drives and things like that. But they may have the type of consciousness that we think we have. At least the illusion of consciousness. So, sentience, I think, can be achieved by the type of architecture I propose if we can make them work, which is a big if. And the reason I think that is that what those systems would be able to do is have objectives that they need to satisfy. Think of them as drives. And having the system compute those drives, which would be basically predictions of the outcome of a situation or a sequence of actions that the agent might take. Basically, those would be indistinguishable from emotions. So, if you have your own situation where you can take a sequence of actions to arrive at a result, and the outcomes that you're predicting is terrible results in your destruction. Okay, that creates fear. You try to figure out, I can say, oh, now the sequence of action I take, that would not result in the same outcome. And make those predictions, but there's a huge uncertainty in the prediction. One of which with probability half, maybe is that you get destroyed, it creates even more fear. And then on the contrary, if the outcome is going to be good, then it's more like elation. So, those are long term prediction of outcomes, which systems that use the architecture I'm proposing, I think we'll have. So, they will have some level of experience and they will have emotions that will drive the behavior because they would be able to anticipate outcomes. And that has act on them. Now, consciousness is a different story. So, my full theory of consciousness, which I've talked to David about, thinking it was going to tell me I'm crazy. But he said, no, actually that overlaps with some pretty common theories of consciousness among philosophers is the idea that we have essentially a single world model in our head. Somewhere in a prefrontal cortex. And that world model is configurable to the situation we're facing at the moment. And so, we're configuring our brain, including our world model, for solving the problem that, you know, satisfying the objective that we currently set to ourselves. And because we only have a civil world model engine, we can only solve one such task at any one time. This is a characteristic of humans and many animals, which is that when we focus on the task, we can't do anything else. We can do subconscious tasks simultaneously, but we can only do one conscious deliberate task at any one time. And it's because we have a single world model engine. Now, why would evolution build us in a way that we have a single world model engine? There's two reasons for this. One reason is that single world model engine can be configured for the situation at hand, but only the part that changes from one situation to another. And so it can share knowledge between different situations. The physics of the world doesn't change if you are building a table or trying to jump over a river or something. And so, your sort of basic knowledge about how the world works doesn't need to be reconfigured. It's only the thing that depends on the situation at hand. So that's one reason. And the second reason is that if we had multiple models of the world, they would have to be individually less powerful because you have to all fit them within your brain and that's an immediate size. So I think that's probably the reason why we only have one. And so, if you have only one world model that needs to be configured for the situation at hand, you need some sort of meta module that configures it. It configures that, like, what situation am I in? What sub-goals should I set myself? And how should I configure the rest of the... My brain to solve that problem. And that module would have to be able to observe the state and capabilities. We'd have to have a model of the rest of itself, of the agent. And that perhaps is something that gives us illusion of consciousness. So I must say this is very speculative. Okay, I'm not saying this is exactly what happens. But it fits with a few things that we know about consciousness. You were saying that this architecture is inspired by cognitive science or neuroscience. How much do you think your work, just work, other people's work, at the kind of the leading edge of deep learning or machine learning research is informing neuroscience? Or is it more of the other way around? Certainly in the beginning, it was the other way around. But at this point, it seems that there's a lot of information that then is reflecting back to those fields. So it's been a bit of a feedback loop. So new concepts in machine learning have driven people in neuroscience and cognitive science to use computational models if you want for whatever studying. And many of my colleagues, my favorite colleagues work on this. The whole field of computational neuroscience basically is around this. And what we're seeing today is a big influence or rather a wide use of deep learning models such as conditional net and transformers as models, explanatory model of what goes on in the visual cortex, for example. So the people, you know, for a number of years now who have done FMRI experiments and then showed the same image to a subject in the FMRI machine and to a conventional net and then try to explain the variance, the observed in the activity of various areas of the brain with the activity that is observed in corresponding neural net. And what comes out of the studies is that the notion of multilayer hierarchy that we have conventional nets matches the type of hierarchy that we observe in the at this eventual pathway of the visual system. So V1 corresponds to the first few layers of the conventional net and in V2 to some of the following layers and V4 more and then the E4 temporal cortex to the top layers are the best explanation of each other if you try to do the matching, right. One of my colleagues at Fair Paris, there's a dual affiliation also with nor a spin that academic lab in Paris has done the same type of experiment using transformer architectures and language models essentially and observing when activity of people who are listening to stories and attempting to understand the story so that they can answer questions about the story or give a summary of it. And there the matching is not that great in the sense that there is some sort of correspondence between the type of activity you observe in those large transformers and the type of activity is in the brain, but the hierarchy is not nearly as clear. And what is clear is that the brain is a capable of making much longer term prediction that those language models are capable of today. So that begs the question of what are we missing in terms of architecture and to some extent it's jibes with the idea that the models that we should have should build hierarchical representations of the percept that different levels of abstraction. So that the highest level of abstraction are able to make long term predictions that they had said are less accurate than the lower level, but longer term. We don't seem to have that in current models. I had a question I wanted to ask you since our last conversation you have a lot of things going on. You teach you have your roll at Facebook, your roll I think at CVPR or how do you work on this have like three days a week or two hours a day where you're just focused. And then are you a tinkering with code or with diagrams or is it in iterations with some of your graduates who the or is this something where it's kind of always in your mind and you're in the shower and you think yeah that might work. I'm just curious how do you love all of it. Okay, so first of all, once you understand is that my position at meta at fair is not a position of management. I don't manage anything. I'm chief scientist, which means I try to inspire others to work on things that I think are promising. And I advise several projects that I'm not personally involved in. I work on strategy and orientations and things like this, but I don't do that to them. I'm very thankful that you know is doing this for fair and doing very very good job. I'm not very good at it either. So it's for you better if I don't if I don't do it. So that allows me to spend quite a bit of time on research itself. And I don't have a group of engineers and scientists working with me have a group of. More junior people working with me students and postdocs both at fair and at NYU. Both in New York and in Paris and and working with students and postdoc is wonderful because. They are fearless, they're recreative. Many of them have amazing talents in theoretical abilities or implementation abilities or an academic things work. And so what happens very often is. Either one of them will come up with an idea that whose results surprise me and said I was thinking that is wrong. And that's the best thing that can happen. Or sometimes I come up with an idea. And turns out to work, which is great. Usually not in the form that I formulated it normally it's there's a lot of contributions that have to. We brought an idea for it to make it work. And then what's happened also quite a bit in the last few years is. I come up with an idea that I'm sure it's going to work. And. She students and postdoc tried to make it work and they come back to me and said I was sorry it doesn't work and here is a fair. Oh yeah, we should have thought about this. It's okay. So here's a new idea to get around this problem. So for example several years ago. I was advocating for the use of generative models with latent variables to handle the uncertainty. And I completely changed my mind about this now. I'm advocating for those joint evading architecture that do not actually predict. And I was. I more or less invented those contrasting methods that a lot of people are talking about and using at this point. Now I'm advocating against them now in favor of. Those methods are such as V Craig or about the twins that basically. Instead of using contrasting methods can try to maximize the information content of representations and that idea. Of information maximization at not about for decades because Jeff was working on this in the 1980s when I was opposed to. And he abandoned the idea pretty much. He had a couple papers with one of his students who back are in the early 90s. That show that he could work but only in sort of small dimension and it pretty much abandoned it. And the reason he abandoned it is because of a major flaw with those methods. Due to the fact that we don't have any good measures of information content or the measures that we had are up about not lower bound. So we can try to maximize information content very well. And so I never thought about those that those methods could ever work because of my experience with. With that. And one of my postdocs, Stephen did actually kind of revise the idea and show that it worked that was about a twins paper. So we changed our mind. And so now that we had a new tool information, maximumization applied to the joint embedding architectures and. Came up with an improvement of it called Vickreg. And and now we're working on that. But there are other ideas we're working on to solve the same problem with other groups of people at the moment. Which probably will come up in the next few months. So we don't again, we don't have a perfect recipe yet. And we're looking for one and hopefully one of the things that we are working on with stick. Now, are you coding models and then training them and running them or are you conceptualizing and turning it over to someone else? So it's mostly conceptualizing and mostly letting the students and postdocs doing the implementation, although I do a little bit of coding myself, but. Not enough to my taste. I wish I could do more. I have a lot of postdocs and students and so I have to devote sufficient amount of my time to interact with them. Sure. And then leave them some breathing room to do the work they do best. So it's an interesting question because that question was asked to Jeff to start right. Yeah. And he said he was using matlab and he said you have to do this those things yourself because there's something doesn't. If you give a project to a student and a project come back saying it doesn't work, you don't know if. Yeah. Because there is a conceptual problem with the idea or whether it's just some stupid detail that wasn't done right. And when I'm facing with this, that's when I start looking at the code and perhaps experimenting with it myself. Yeah. Or I get multiple students to work on them to collaborate on the project so that if one makes an error perhaps the other one will detect what it is. I love coding. I just don't do as much as I like to. Yeah. So if you have a Java or the forward forward, things have moved so quickly. You think back to when the transformers were introduced or at least the attention mechanism and that kind of shifted the field. It's difficult for an outsider to judge when I hear the jump attack. Is this one of those moments that wow this idea is going to transform the field? I have you been through many of these moments and they contribute to some extent but they're not the answer to ship the paradigm. It's hard to tell at first, but whenever I kind of keep pursuing an idea and promote it, it's because I have a good hunch that they're going to have a relatively big impact. And it was easy for me to do before I was as famous as I am now because I wasn't listen to that much. So I could make some claim and now I have to be careful what I claim because a lot of people listen to me. Yeah. And it's the same issue with Jeff. So Jeff, for example, a few years ago was promoting this idea of capsules. Yeah. Everybody was thinking this is going to be like a big thing a lot of people started working on it. It turns out it's very hard to make it work and it didn't have the impact that many people started would have, including Jeff. And they turned out to be limited by implementation issues and stuff like that. The underlying idea behind it is good, but like very often the practical side of it kills it. There was the case also with lots of machines. They're conceptually super interesting. They just don't work that well. They don't scale very well. They're very slow to train because actually it's a very interesting idea that everybody should know about. So there's a lot of those ideas that are conceptually that allow us. There are some mental objects that allow us to think differently about what we do. But they may not actually have that much practical impact. For forward, we don't know yet. Okay, it could be like the wake sleep algorithm that Jeff talked about 20 years ago or something. Or it could be the new back prop. We don't know. Or the new target prop, which is interesting, but not really mainstream. Because it. It has some advantages in some situations, but it's not. It brings you like an improved performance on some standard benchmark that people are interested in. So it doesn't have the right to deal perhaps. So it's hard to figure out. But what I can tell you is that if we figure out how to train one of those. Jet by start architecture from video. And the representations that it learns are good. And the predictive model that he learns are good. This is going to open the door to. And you breed of AI systems. I have no no doubt about that. It's exciting. The speed at which. Things have been moving in particular in the last three years. About about transformers and the history of transformers. One thing I want to say about this is that. We see the most visible progress, but we don't realize that how much of a history there was behind it. And even the people who actually came up with some of those ideas don't realize that. The idea is actually had roots in other things. And so, you know, the example back in the 90s. People were already working on. Things that we could now call mixer of experts. And also multiplicative interactions, which at the time were called. The same high networks or things like that. So it's the idea that instead of having two variables that you add together with weights. You multiply them and then you have a way for you have weights before you multiply. It doesn't matter. This side, it goes back every long time since the 1980s. And then you had ideas of linearly combining multiple inputs with weights that are between 0 and 1 and some to 1 and are dependent. So now we call this attention, but this is a circuit that was used in mixer mixer of expert models back in the early 90s also. So that's the old. Then there were ideas of neural networks that have a separate module for computation and memory that's two separate modules. So one module that is a classical neural net. And the output of that module would be an address into an associative memory that itself would be a different type of neural net. And those different types of neural net associative memories. Use what we now call attention. So they compute the similarity or the product between a query vector and a bunch of key vectors. And then they normalize and so this onto one and then the output of the memory is weighted some of the value value vectors. There was a series of papers by my colleagues in the early days of fair actually in 2014, 15 one called memory network, one called end to end memory network, one called the stack of maintained memory network and other one called key value memory network and then a whole bunch of things. And those use those associative memories that basically are the basic modules that are used inside the transformers and then attention mechanism like this were popularized in around 2015 by a paper from the usual bench was good at Miller and demonstrated that they are extremely powerful for doing things like translation language translation in LLP. And that really started the craze on attention. And so you come on all those ideas and you get a transformer that uses something called self attention where the input tokens are used both as queries and keys in a associative memory very much like a memory network. And then you use this as a layer if you want you put several of those in a layer and then you stack those layers and that's what the transformer is. The feeling is not obvious but there is one those ideas have been around and people have been talking about it and the similar work also around 2015 16 and from deep mind called the neural turning machine or differentiable neural computer those ideas that you have a separate module for competition and other one from memory. And there's a paper by step or writer and his group also on neural nets that have separate memory associative memory type system. They are the same type of things. I think this idea is very powerful. The big advantage of transformers is that the same way commercial nets are equivalent to shift so to shift the input of a commercial net. So shift but otherwise doesn't change. It transformer if you commute the input tokens the output tokens get permuted the same way but are otherwise unchanged so. Comments are equivalent to shifts. Transformers are equivalent to permutation and with a combination of the two is great. That's why I think the combination of the low level and transformer at the top I think for natural input data like image and video is a very combination. Is there a combinatorial effect as the field progresses all of these ideas create a cascade of new ideas. Is that why the field is speeding up? It's not the only reason the there's a number of reasons the. So one of the reasons is that you build on each other's ideas and etc which of course is the whole mark of science in general also art. But there is a number of characteristics I think that. Help that to a large extent the one in particular is the fact that. Most research work in this area now comes with code that other people can use and build upon right so. The habit of distributing your code in a source I think is a is an enormous. Contributor to the acceleration of progress the other one is. The availability of the most sophisticated tools like pito arch for example or TensorFlow or jacks or things like that where which where researchers can. Build on top of each other's code base basically to. Come up with really complex concepts. And all of this is committed by the fact that some of the main contributors that are from industry to those ideas. Don't seem to be too. Obsessive compulsive about IP protection. So meta and in particular is very open. We may occasionally fight patterns but we're not going to see you for infringing them unless you sue us. Google as a similar policy. You don't see this much from companies that tend to be a little more secretive about their research like Apple and Amazon but although I just talked to Sam in bed. Yeah he's trying to implement that openness. More power to him good luck. It's a culture change for a company like Apple so this is not a battle I want to fight but if you can win it like good for him. Yeah. It's difficult difficult battle. Also I think another contributor is that there are real practical commercial applications of all of this. They're not just imagined they are real. And so that creates a market and that increases the size of the community. And so that creates more appeal for new ideas right more more. Outlets if you want for new ideas do you think that this. Hockey stick curve is going to continue for a while or do you think will hit a plateau then. Is it difficult to say nothing works more like a next next financial that the beginning of a sigmoid so every natural process has to saturate at some point. Yeah the question is when. And I don't see any obvious wall that is being hit by a research at the moment it's quite the opposite seems to be an acceleration in fact of progress. And there's no question that we need the new concepts and new ideas in fact that's the purpose of my research at the moment because I think there are limitations to current approaches. So this is not to say that we just need to. Scale up deep learning and turn the crank and we'll get to human level intelligence. I don't believe that. I don't believe that it's just a matter of making reinforcement learning more efficient. I don't think that's possible with the current way reinforcement learning is formulated. And we're not going to get there with supervised learning either. I think we definitely need. New innovative concepts. But I don't see any slowdown yet. I don't see any people turning away from me I'm saying it's obviously not going to work. Despite there is. Screams of various critics right. Yeah sure about that but. But. I think to some extent at the moment are fighting a real guard battle. Yeah because they plan to flag this and you're never going to be able to do this and then. Turns out you can do this or they plan to flag a little further down. And now you're not going to be able to do this. So it's a tiny yeah. Okay my last question. Are you still doing music? I am. And are you still building instruments or I'm building instruments electronic wind instruments? Yes. Any process of designing a new one. Wow. Yeah okay maybe I think I said this last time maybe I could get some recordings and. Put them into the podcast or something. I probably told you and that's such a great performer. I'm probably better at conceptualizing and building those instruments and playing them. But yeah it's possible. That's it for this episode. I want to thank you and for his time. If you want to read a transcript of today's conversation you can find one on our website. I on a I that's EY E hyphen O N dot A I. Feel free to drop us a line with comments or suggestions at Craig at I on A I that's C R A I G. At EY E hyphen O N dot A I. And remember the singularity may not be near. But A I is about to change your world. So pay attention.