Images and Words: AI in 2026

Images and Words: AI in 2026

Generative AI, in particular LLMs and diffusion models, has pushed the capabilities of what ML can do beyond the expectations of many researchers a few years ago. How should this be extrapolated into the future? Are we close to artificial general intelligence? To an abundance of high quality movies and software written without careful human supervision? To explore these questions I have been playing with these models as well as reading recent research pointing to where they may be headed next. I set myself a somewhat arbitrary horizon of 2026 to make some predictions of what I think AI won't be able to do by then. I then speculate on what some future applications of AI might be.

In a previous Links post, and in a recent tweet I expressed my relative lack of excitement about what a lot of people are doing with what I called "the AI stuff" (narrowly, large language models and diffusion models, collectively "generative AI"; excluding e.g. Tesla's FSD or AlphaFold). In an even earlier tweet, I asked Twitter if we had learned anything new from LLMs yet, as opposed to LLMs telling us what we (the internet) already knew; the conclusion being that we have not. Nostalgebraist has been writing on the same lines for a while.

Recent AI hype contrasts with the fact that GPT3 was first released to the public in June of 2020 and since then, not that much has been done with it, with the recent exception of its descendent model Codex. I do not think this is because of insufficient tinkering with the models, but rather they have to do with intrinsic limitations of the models that are currently available and that I expect will be available in the foreseeable future.

Up until recently the most popular use case seems to have been marketing copywriting. I don't find this particularly exciting, search on the internet is already full of bland content marketing. It's different if one does content marketing by, say, writing an article about how to use your API, or if a company does a toy project using their API and then writes about it. The hard work here is in the project or thinking about the use case in a thoughtful way, and less so in the writing itself. Admittedly I do not know much about content marketing though!

I find code generation (Copilot, Ghostwriter) more exciting. Copilot can be genuinely helpful and by some accounts it's bringing hundreds of millions of dollars to Github.

Then there's of course image (and video!) generation. The systems I'm aware of and that I have used are DALLE2, Stable Diffusion, and Midjourney. These are fun to play with and I predict they will be used in a number of creative tasks from generating images for posts to assets for videogames to movie styling. RunwayML but also seem to be the leaders here. A recent concrete example I saw in the wild was generating the header of this interview with Michael Nielsen by generating a plausible image completion around a smaller image.

Looking back, four years ago the state of the art in image generation was generating faces and numbers. There were no scaling laws papers and the first GPT paper had just been released. Rapid progress in some domains in the last years have led to some to think that not only progress in these domains will continue as fast or faster, but that we are on the verge of full-blown artificial general intelligence.

I don't see it that way: Progress has indeed been fast lately. In a few years I will not be writing a snarky remark about GPT-N not being used for much because by then it will be obvious that there are at least a handful of legitimately useful applications for LLMs in broad use. At the same time, in a few years I do expect some people that today are really excited will feel disillusioned: we'll get close, but we'll remain far, as happened with self-driving cars: they seemed so close many years ago!

It's easy (and cheap) to theorize about what will or won't happen by some unspecified date in the future. A much harder thing to do is to commit to specific predictions or bets of the form "By 2026, AI won't be able to do X". What should those X be? This essay is an attempt to sort through my thoughts and come up with those bets.

One lens this essay is written through is what applications of AI would I find useful. I don't do video editing, am not an artist, and have lots of knowledge and means to find it and index it. The lack of excitement may come from the fact that I want to have models helping me just like they can help others do their job. The topics I've decided to focus on here are guided (or biased by) those considerations.

A second lens is forecasting progress towards general artificial intelligence. In a previous post I discussed how AI systems could pose danger, but I did not discuss when those systems will actually be built. The present essay is an effort to think about AI capabilities timelines in the short-to-medium term. This lens led to the sections on "true understanding", compositionality, and remarks on progress in narrow-purpose models and general-purpose models.

This is important to keep in mind because if not this essay will seem to be harshly dismissing genuine progress in the field.

Throughout the essay I also suggest research directions and ways to improve the models. At the end of the essay, in an appendix, I suggest some startup ideas in this domain.

Will we get a Hollywood-tier blockbuster movie within the next 5 years? Video generation is still in what I'd describe as its infancy. Microsoft's NUWA, Google's Imagen Video, Phenaki, or Meta's Makeavideo are impressive as technological achievements and without doubt we will see better models in years to come. The parameter counts in the models are not particularly high compared to say PaLM, and the training data Imagen uses does not include everything Google has access to (They could eventually train on a large dataset of youtube videos).

However! One can still see some peculiar weird artifacts in most of these videos, in this particular video from Google Imagen Video or the one below that one, from Phenaki. Note how some features (like the eyes of the bear or the sails of the ship) pop in and out of existence, or wobble around:

Why this wobbling? My explanation is that the model doesn't have conceptual understanding, a claim some readers will nod to enthusiastically and others will predictably think it's either wrong or true but irrelevant, so more on this later; I of course hold a third more complex opinion. The model is trying to generate video that looks coherent in a similar way the videos in the training set look coherent, and doing so conditional on a prompt. In contrast, you could imagine a model that instead generates a 3D model of a bear and a texture (which stays constant throughout the video), then generates a set of animations, then animates the bear through the water. This is what a human would do at an animation studio and plausibly this could too be done with AI (There are models to generate 3D models out of images already, as well as models to generate animations from prompts). This is however contrary to the ethos of the modern approach to AI: Eschew purpose-specific solutions, seek end to end general solutions by training large models on large corpora of the desired kind of final output data (movies).

To be sure, this approach has been paying off well for Tesla and FSD, who started with many purpose-specific systems and ended with progressively more end-to-end systems, but this is taking longer than expected, and even today we are not a a level where a driver can stop paying attention. It doesn't matter if 95% of the time the model is reasonable, if it can do something unsafe in that 5% of the time a driver has to remain alert.

There are not many examples of videos fully generated with AI yet, but there are many images. Many or even most of these images look satisfactorily pixel perfect. Have a look at this gallery in Lexica for some examples. But sometimes you find things like this where the arms look all off. Generating one-off images, especially when one can try multiple times is one thing. Generating a movie is another. A two minute clip (at 24 fps) is a sequence of 2880 pictures. They all have to be accurate and coherent with the previous ones. A two hour movie is over 10 million individual frames.

The Phenaki demo website does have examples of videos longer than two minutes. I can easily see how voice and a longer script could be added to this, and we'd have a passable movie, but not a blockbuster. Fine, one may say. It's not a Hollywood competitor now, but what about in 2026? Remember, we had no AI generated video just a few years ago, then we got to these kind of videos by interpolating images, and now we have better spatiotemporal coherence with the new video models. Isn't it easy to imagine that in 2026, after all the investment the field has seen and will continue to see, they will be able to output 10 million consecutive individual frames of pixel-perfect Hollywood-level quality? For the probability of a sequence of randomly sampled 10 million frames to be all coherent to be >90%, the probability that each frame is good must be over 0.99999999. It's hard to cash this out in precise benchmarks; where are current models at? And how do we account for the fact that frames are not really sampled randomly? But my intuition from looking at images generated with these models says they are far from there; and likewise my intuition is that going from mediocre to good is substantially easier than going from good to great. One could allow for a handful of less than perfect frames, and for multiple shots at the task and get away with less reliable models.

A given kind of person sees models from five years ago with just faces, models now, and draws a line from A to B and extrapolates that they will keep improving. If you are me you see improvements in some dimensions but not others (Obsessing over subcomponents of narratives is a Nintil house move), and so you predict continued improvements in some dimensions, and less so in the others that haven't improved or have improved at a slower rate.

What will improve? Quality of textures for sure, Midjourney V4, released as I am writing this essay is already one leap in this direction. Spatiotemporal coherence will improve a bit. Both together will improve at a slower pace.

I expect producing an anime movie is easier because textures are simpler but it will still be challenging. See this example clip. Textures are simpler, but there's still keeping track of where the characters are in the scene, where the camera is looking at, when to do closeups or wider shots, syncing voice, music, and video.

I just seem to have a strong prior that without a a better understanding of objects as entities over space-time (as opposed to something closer to just pixels on a screen) we are not going to get there by 2026 (to fully automate movie generation).

Image AI systems available today seem to struggle with compositionality: this is being able to lay out concrete objects in defined spatial relationships according to a prompt. This breaks down in scenes where there are varied objects that have to have specific spatial relations. One representative example of this sort of problem, with a prompt I just made up is here (DALLE2):

None of these pictures meets the requirements of the prompt (One has a red speaker, they all lack the espresso machine, three don't have the computer monitor, one doesn't have a laptop etc). These mistakes are very trivial to see for a human being that has been asked to produce the image from the prompt.

Stable Diffusion v1.5 did better than DALLE, and below Midjourney V4 did better than SD, but still none of the images capture what I wanted

Here are some other fun examples:

The issue here is not whether I am a good prompt engineer (It's my first try at this particular scene). I am sure one can get better results by playing with the model. The issue is not one of comparing relative performance at generating some output (These models can produce outputs that I, without extensive training, cannot produce).

The issue is that the models are far from really understanding what they are outputting to the same level a human would. This happens (maybe?) because AI models for vision seem to like to think in terms of textures (not that they only do this, but they do it more than we do). If a given area has enough of the right texture (or color) then the image looks too okay to the model. This explains why there are a few green bits in the imaged that I didn't ask for. We might be overrating what these models actually do because often we see the (really good) end-products of prompt engineering and example picking (unless one has spent sufficient time playing with the models).

Ok but if one had asked me, five years ago, how likely is it that you'll see the images I just posted, what would I have said (after doing a brief literature review)? Five years ago (2017) there was work on generating small synthetic images that looked okay. These faces are probably the most detailed images we had back then. In 2015 there was at least one paper that showed it was possible to generate very small and blurry images of scenes with multiple objects from prompts. Since then, now we have seen better textures, a greater variety of objects being depicted, and higher resolution. I'm naturally drawn to assume that these will keep getting better. But also, there has been little progress in getting the images to correspond robustly, and repeatably to what models are being asked to produce. This, I then think, is also tied to the blurry artifacts and weird images we see occasionally. Larger models with the same architecture will struggle in this exact way.

Could this be fixed in the training set? There are not as many scenes with multiple objects in the training sets, whereas there are many depictions of individual objects. In principle one could generate more scenes if we have the underlying objects. Perhaps we can ask the model to generate a tea cup and an expresso machine, then copy them side by side, do some painting over with another model call, and use that resulting image as a scene, then build a corpus of such scenes to improve the model's understanding of compositionality. Doing this seems well within what's currently possible.

There is precedent for a big compositionality problem being solved: text. If you search for images with text in Lexica you'll find that each letter makes sense but the whole looks like mangled garbage. Not so with Google Imagen, that is able to reproduce, at least sometimes, well ordered text, introducing a benchmark for compositionality (DrawBench). Figure A.18 shows one clear example of a prompt of the kind that DALLE struggles with but Imagen is able to successfully depict. Contrary to "scale is all you need" memes, this advance was the result of a careful investigation of prompt-guidance and a novel thresholding technique (Section 2.3). Drawbench includes some examples drawn from this paper from Gary Marcus et al. (2022) where clearly DALLE2 struggles, but those are not shown in the examples showcased in the Imagen paper nor is Imagen publicly available yet so I was not able to poke at the model for this essay. Because of the Imagen results I predict that image generation with well-ordered text will be a solved problem by 2026.

Another recent paper (Liu et al., 2022 "Composable Diffusion") gives up having a single model to generate entire images and breaks down queries into submodels which can then be aggregated back again, while at the same time they include training sets that specifically include object relations as I suggested above. One of the example prompts they have is “A green tree swaying in the wind” AND “A red brick house located behind a tree” AND “A healthy lawn in front of the house”. Stable diffusion does not struggle with this one at all, perhaps because lawns with trees and houses behind are not unusual. In contrast, an artificial prompt like “A large blue metal cube to the left of a small yellow metal sphere” AND “A large blue metal cube in front of a large cyan metal cylinder” leads to this sort of thing which the Liu paper does better at. Even after seeing this result, I am still convinced that training sets with more complex scenes will be required for models to get composition in a robust way.

In the appendix I took a number of essays/blogposts/articles, copied a chunk of them to GPT3 and got continuations, trying about 3 times with each and picking what I thought was the best. I then compared that with the original text. In all cases, I would have preferred to read the original text.

Note that this is different from whether the GPT3-generated text is humanlike (which it often is) and correct (which is also roughly is). But when we read something we usually don't want to be told what we already know: we want to be exposed to novelty and GPT3 does a mediocre job at this.

You can head to the appendix, read the text there and then come back to this section.

A summary of my analysis of this brief blogpost completion exercise is that the GPT3 completions have less detail than the interesting blogposts. In a way, this is reminiscent of early image generation models where the textures looked very soft (Newer models, particularly Midjourney V4 are now able of close to photorealism). The system Elicit uses (Primer?) which breaks down questions manually and allows for search produces better results. For example for the prompt "Does rapamycin extend lifespan in mice?" Gets us from GPT3

Whereas Elicit would say something much better (and with working links):

I myself would have written something like:

My answer has less detail than Elicit (Elicit cites more papers) but arguably mine is better because it gives the right level of detail: The ITP paper is strong enough to base most of the answer on it, and my answer does not confuse someone that is not aware of there being different strains of mice, how translatable research is, or how much credibility to assign to the ITP vs other efforts to measure lifespan. My answer also has the benefit of coming from someone that has written a Longevity FAQ and Nintil in general: If you trust my rigor in general you may also trust me in this particular case.

GPT3 (and future models) face a problem similar to the tools for thought domain (thread). For a newbie in a given domain, GPT3 has knowledge they don't have; but they may not know what questions to ask or what the answer even means in its full richness, or whether to even trust it, and their needs may be better served by simpler approaches like a Google search. For the expert, they already know the domain, so while they can judge GPT3's answers they have no use for them because they already know them.

This will change in the near future: Just today (2022-11-08) a paper from Anthropic came out trying to use an LLM-powered assistant to help newbies with a benchmark task. Plausibly an extrapolation of this ("raising the floor") is models fine-tuned by domain experts and deemed accurate by them, so then newcomers can trust them. We could have models fine-tuned with the help of doctors that are really good at telling patients what their symptoms mean. On the other hand, the generation of novelty ("raising the ceiling") seems harder by virtue of how these models are trained (to predict the next most likely token).

How might this latter, raising the ceiling, be accomplished? Obviously we don't want unhinged text, or models trying to predict the least likely token, we want something that maximizes some "interestingness" metric as opposed to a "likelihood" metric. Likelihood as usually used in ML training is a property of a token given a dataset whereas interestingness is more of a social construct that depends on who is reading the text and when has the text been written. Talking about say general relativity if asked about weird anomalies in the orbit of Mercury is not as interesting now as it was before Einstein was born. So one way to perhaps get models to learn interestingness is to take examples of texts considered interesting at the time, and find a way to finetune models to produce them when fed text produced temporally prior to the interesting text. This is at the time very hard to do given that all this text wouldn't fit in the context window, and this sort of training would require having reliable timestamps for the training datasets.

Ultimately there's a function that goes from knowledge today to knowledge tomorrow and we wish to learn that function. Right now the most promising approaches to get more interestingness it seems to train the model on datasets that represent blogposts and scientific papers more often, and then finetune based on prompts.

I look forward to the time when models can produce special relativity or CRISPR when trained on datasets that do not include mention of those! In the meantime, we might get individual assistants that constantly learn from each individual's preferences and knowledge.

Common sense reasoning used to be one of the holy grails of AI, perhaps after this John McCarthy paper from 1959. One day we woke up and a solution unexpectedly popped into existence: Large Language Models. Or so it seems. If one doesn't buy this yet then one can imagine slightly better systems that will surely come in the future. But after having played a lot with GPT3, I can say that for many questions, if that knowledge is public information on the internet, GPT-3 can answer them relatively competently. Some examples below, including some examples that many humans would not know the answer to!

One can construct prompts where GPT3 fails if one is tries hard enough to find edge cases. It may be possible that sometimes GPT3 gets these right, or that future models will get these right, or that GPT3 itself given a few examples would get these right as well. The point is not so much how capable GPT3 is or isn't but rather that despite being able to give correct answers in the cases earlier, it still does so without understanding everything to the same level a human would. By extrapolation, future systems will be more impressive, but might still feature silly bugs like the ones below.

The letter count task and grid task at the end can be found in this essay's companion Github repo. I tested those ones in zero, one, and two shot settings and could not get good performance out of GPT3.

More generally, there are benchmarks that test the capabilities of ML models. These benchmarks include questions like the ones above; one of them (The one about Ian and Steven) comes from one such benchmark (Winogrande). ML models have been blowing past these benchmarks faster and faster, to the point that in a forecasting exercise, the performance level for a particularly hard benchmark (MATH) that wasn't supposed to be achieved until 2025 was in fact achieved in mid- 2022 already. These benchmarks tend to consist of questions like the ones I generated above, each trying to isolate a handful of variables and involving a handful of entities to reason about at a time.

Be that as it may, this improvement in benchmarks has not yet translated into real world deployment. We can speculate why, and that speculation probably will point us to the work that's left in the road towards more generally intelligent agents.

First, models armed with just common sense reasoning are not that useful to humans in most contexts as most people will do that reasoning by default unaided (that's why it's common sense!). But a given human does not know all publicly available facts; LLMs kind of do. However, a human will reach out for Google, so human+Google search is the standard to beat. Using LLMs as better search engines is an active area of research and development, with companies like Metaphor trying to build search engines powered by LLMs.

Second, there are many contexts where LLMs can do an ok job where we have better purpose-specific systems. Indeed one could ask GPT3 to count the letters in "blap1234", but if doing so is useful and is going to be done lots of times, the time spent in writing a python function to do that is trivial and worth doing: in those cases people will using purpose-specific systems instead of calls to LLMs. Hence, human+google search+small programs is a further standard to beat.

GPT3 might have gotten this question wrong earlier, but the same question, when posed to the coding-specific Codex gets us the right answer (it produces a more reliable purpose-specific algorithm which we can then run)

Codex (and Copilot) are relatively recent additions that as I remarked in the introduction are the first interesting useful application of LLMs that has seen some level of mass adoption.

To further drive the point home: Yes, GPT3 can be given some CSV-formatted data and then it can answer questions about it and sometimes it'll get it right! But if you were doing this in a serious context where getting correct answers matters, or if the datasets are large, you would rather use a parser specifically for that. Even when setting costs aside, would you rather write a data pipeline or a prompt for an LLM? I'll keep the pipelines... but Codex can now help write them.

Third, there are many domains that have so far remained beyond the reach of LLMs because there isn't enough of the right kind of data to train the models on, say models for Computer-Aided Design (CAD), or for prediction of the proteome from transcriptome and epigenetic state. Broadly, complex planning tasks where the context has to be selected from many possible available facts, a key step towards general intelligence. Right now I'm aware of what's in front of me, of what I'm typing, of my physical location, calendar appointments, ongoing conversations, etc. It is one thing to handpick the relevant facts and feed them to a model, and another to list every possible fact that you could possibly be aware of now and narrow it down to your task, and then do the task. We can't experiment with this yet because model's context windows are not long enough yet.

Code generation these days is quite impressive. Here is a more complex prompt than the simple function from the previous section:

And here is the code produced, verbatim from Codex. I copypasted it and ran it, checking that it indeed runs as one would expect. The primality algorithm could be improved and deserialization could be done in a safer way but it's a start.

Ok fine one might say. But writing request handlers and checking whether a number is prime has been done to death, what is a more interesting, unfairly complicated program we can write? I asked Codex various times to Write the backend of an app that acts as a clone of Twitter and while I got a bunch of valid-looking code, I did not get something I could just run to Y Combinator with next day.

Another useful piece of software to have is a clone of Turbotax. Codex can't produce Turbotax yet based on this prompt:

My total income this year was X dollars. Write a program that calculate my taxes according to the latest published IRS regulations. I contributed 1000 dollars to my Traditional IRA. I also bought and sold stocks and had both W2 and 1099 income.

Being more reasonable, I tried a couple of times the prompt Read all pdf files in the directory called frames and return a list of the ones that have text containing the string "Hiring". To avoid issues when parsing PDFs, use OCR to extract text. Set a high PIL MAX_IMAGE_PIXELS. This indeed leads to a program that converts PDFs to images and then searches for the string I mentioned. Neat! However that on first trial the program crashed because PIL, the library being used by the generated code, gives up if the file is too large so I had to manually add the last part to the prompt. Still, neat! Though you could imagine how this program could work, Codex adds value in that it can write it faster than you can, and then you can add the finishing touches if needed. I was also able to get good results eventually with Extract 100 random frames from an input video, send them to the AWS Recognition API to check for illicit content.

Now for something that's not as easy: Download all scientific papers from the internet and OCR them. Store them locally. The second part of this is the code that we got from earlier, the first part is left to us to define. Codex on its own does nothing useful with this prompt. But also most people wouldn't know where to start: this is far from common sense. Perhaps going to Google Scholar, doing random searches, and scraping all PDFs linked from there? But many papers do not have publicly available PDFs, plus this random search approach could take forever. One could then get the Semantic Scholar dataset, which I happen to know is fairly comprehensive, get the DOIs from there, then plug those into Sci-hub, and get the PDFs from there. We might be able to plug this into GPT3 to ideate, and then copy the GPT3 ideas into Codex, but I wasn't able to get much useful doing this, and frequently I got nonsense.

I've seen some people plugging in prompts into Google search, getting items from there, and iterating back and forth with the LLM and search. For example, one might do the following:

This is something Codex cannot do today. It is a particularly hard case because it involves multiple iterations and one human interaction with an external party. Arguably precisely these are the hard parts of software engineering, when one doesn't know exactly how to do something, when someone else needs to give you access to a resource, or where unexpected errors have to be debugged.

At first, Codex seems to be able to produce impressive pieces of code spanning multiple files and languages. A single prompt was able to generate a simple Tetris game using html, css, and javascript (results in this repl). The code is almost correct (e.g. There is a bug when rotating the Tetrominoes) but still, it's a mostly functioning ~450 LOC program produced from a short prompt! One might object here that Tetris has been around for decades and there are multiple implementations that are open source. Codex might just be looking up those and copying them. Sure enough I was able to find snippets here and there on Github that resembled the Codex-generated code but no evidence of explicit copy-paste.

Trying to think of a task that a developer that can code Tetris could do but a model that doesn't understand coding as well cannot do, I decided to give Codex a longer assignment, to implement a single player interface to the game of Set. Set needs to be explained first, so I wrote an explanation of a slightly simplified version of Set below. A reference implementation of what the output could look like is this one.

Codex did generate HTML, CSS, and JS, and I tried around 6 times to generate various versions of the code, but I could not get it to produce anything playable. Often, the code quality it produced was really bad.

As it happens, there are not many implementations of Set in Github. One can find some, which have better code quality than what Codex gives you. I tried the text in that repo's README as a prompt to see if that got us somewhere, but turned out to be worse than my own prompt.

So what do we conclude from these experiments? Codex does not know how to program to the level of proficiency of a junior software engineer, despite the facts of being able to produce some output that, if produced by a human, would lead you to think that it in facts is capable of more. Codex does well when there are lots of publicly available examples of the kinds of code it is being asked to write. It is able of more than parroting back code, one can grant it some ability of understanding what code does, because it does translation between languages reasonably well, and is able to compose programs that do different logical operations into a unified whole ("extract random frames+check using an API", for example). But it seems to struggle with prompts that are not as specific. The human thing to do would be to ask for help or clarification. One way to overcome this would be to provide function signatures and asking it to fill them in. In statically typed languages, the types could constrain the model enough to provide a reasonable answer, especially if the model is allowed to read the results of the type-checker and feed that back into the program.

Tinkering with Codex, if anything, reveals that a lot of currently practiced software engineering is in fact plumbing and recombining snippets of code that others have created previously. Even if Codex cannot by itself build the Airbnb website, these models can eventually relieve developers from the most mundane tasks, freeing them to think about the challenging and creative parts of software engineering.

It's a common point of debate to argue about whether ML models truly understand their output. Arguing over what understanding means is harder than agreeing whether a model passes an easier to define Turing-style test: If we take human-level as the level of an understander, then clearly current models are far from being able to understand everything as well as we do. The examples above from image generation models show outputs that a human would not make unless say drunk or trying to deceive you.

A system that is able to understand a domain learns the domain in a way that looks different from the way a system that doesn't understand the domain does. Take addition and subtraction of natural numbers, for a simple case. This domain involves awareness of what the natural numbers are and how they are ordered, and being aware of the abstract notions of adding and subtracting numbers together. Importantly someone that understands how to sum knows when they are not able to perform the operation (say, if the numbers are too big). GPT-type models generally will try regardless and be wrong a lot of the times.

When I learned these operations, I learned an algorithm to do sums and subtractions by hand, along with the broad idea of what those things mean, and learned how one could use a calculator to sum numbers. I also noticed that sometimes one makes mistakes when summing large numbers if one is not paying enough attention, in which case one wants to use a calculator.

For a system that understands these operations, the performance in them should not be altered by the length of the number. Numbers are numbers and the algorithm followed should be the same. And yet we don't see this for GPT3: it gets two and three digit operations and then utterly fails with one more digit. The likely explanation is precisely that it doesn't really understand what it's doing:

Minerva, which was trained as a purpose-specific system for a narrower set of tasks, including arithmetic, does better than GPT3, but also presents the same problem, especially for multiplication.

A model that really understood addition (or subtraction, or multiplication) should present performance that is the same regardless of how big the numbers are: the curves should be flat at 100%, especially if we count examples that the model outputs as opposed where the model admits to "be too unsure". Ideally, the model would either return the correct answer or recognize the problem and call a python script to compute the right answer. Sure enough, one could train models for this specific task to do what I just suggested and that would pass the test. Then, if in other contexts where the "+" sign appears where a human would always know is addition, if then the model fails to do the right thing, then we can conclude it did not understand addition after all.

Why make this point? If we really deeply care about e.g. addition, can't we just finetune and scale models to solve the kinds of tasks they currently fail at? Minerva after all shows markedly better performance than GPT3? If one focuses on small enough digits, the models seem to work fine, the curves can be bent up as much as we want!

In the limit, yes, if we had infinite data and parameters. In practice, data and compute are finite. The point I am making here is that these models are not doing these operations like we do, and because of that they struggle to generalize them when exposed to unusual kinds of questions that were not in their training set like big numbers or like the 'unfair' questions in the common sense section earlier.

Here one could say that perhaps they don't fully get arithmetic because there's not that much of it in the training set. Yes, that's part of the why: With more of it you get better performance. But a human being doesn't need a million examples of sums to learn to generalize the concept of sum over arbitrary large numbers. To be sure, there are systems that can actually get this robust sense of understanding if they have access to coding tools. DreamCoder (Ellis et al., 2020) or the famously impractical AIXI work by trying to generate the simplest programs possible that can produce the inputs seen so far. This is a step up in robustness from what neural networks do, and one could imagine enhancing transformers with these sort of symbolic approaches in the future. Or perhaps it'll all be transformers! It wouldn't be the first time a field rejects a paradigm and then returns back to it.

How do transformer models actually do arithmetic? One could do a circuits style examination of this and study why exactly the models break with bigger numbers, but I have not seen any. It sounds fun to take a small LLM and training it purely to learn how to sum progressively larger numbers, then observe what it's learning.

The BigBench collection of benchmarks includes some problems that are like the ones I have in mind like this logic grid puzzle, and where models do not seem to get better across four orders of magnitude in parameter increase, doing as well as choosing randomly between the options given. That is, with the exception of PaLM when given at least one example which is slightly better than the average human rater in their set, but still markedly worse than the best rater. In a different task using emojis, Pig Latin(which are less common on the internet), models still struggle. Models do get many common sense reasoning tasks right, but one can always find common sense BigBench tasks where they still struggle. The building blocks of reasoning required for the tasks models fail at seem to be there: they are applied individually in other tasks, but somehow they fail to realize that they have them and can be combined to solve the task.

In summary, models seem to be interacting with the world with their symbolic arm tied behind their back. We don't have that limitation.

Some research ideas for benchmarks where we could test whether a model 'really' understands a concept: They all start from the same premise, that given a series of concepts linked by relatively simple mechanical rules, performance on a task involving the concepts should not depend on the number of these entities. So the tests have to be able to be arbitrarily scalable so we can evaluate performance across entities. These have transformers in mind, of course there are systems that do these very flawlessly. On this test, a python REPL understands addition in a way GPT-3 does not! This is okay.

What we would do with these is to look at the shape of the (number of entities/complexity, performance) graph and see whether it's flat. If it is, then we can say the model has correctly learned the underlying concepts.

Can we have a similar benchmark for understanding the concept of 'dog'? Arguably LLMs understand dogs better than they understand logic, so I expect they'll do well at these. But the same strategy to generate examples doesn't obviously work here. For a concept like that commonly used reasoning benchmarks with questions like 'Do dogs have four legs?' seem enough.

One general heuristic that I do believe in is that purpose specific systems beat general systems, given the same amount of compute and resources. GPT3 can play chess, but AlphaZero is better, GPT3 can steer a car if given a description of the scene, but Tesla's FSD will do better, GPT3 can write code, but Codex does better and so forth. I do not know what Google Translate uses, but I bet it's not PaLM, it's probably a custom built system specifically for translation.

Why make this point? Because it shows that while we are seeing progressively more powerful AI systems in the wild, these are not necessarily indicative of progress towards general intelligence. I do not make this point just because there happens to be purpose-specific systems that perform better than general purpose systems. I make this point because

"Purpose-specific" is doing perhaps too much work. In a way "answering common-sense questions as if you had all public written knowledge" is "purpose-specific" but it is also a very broad category, even if it cannot drive a car as well as FSD can. But common sense is not all there is to intelligence.

There are sequence prediction tasks that are not commonsensical: Parsing a genome and pointing to deleterious mutations by how surprising they are to the model. GPT3 can't do this and GPT4 won't do this. Same for parsing SMILES into chemical structures, but you can imagine a model built just for that tasks that does it reasonably well.

By itself, this is not an issue. In a previous essay I argued that we can safely accept as a premise that humans are not generally intelligent agents, and that human-level intelligence does not require a system to be able to do any arbitrary task. If we could engineer a system that can produce and control other systems and then that aggregate can do what humans can do, that would suffice. Recall the problem from earlier, asking GPT3 to count numbers in a series of words and failing to do so. In theory, we could ask the model to hand off the task to Codex if it detects a problem that is better handled in code. In practice (from experimenting a bit with this) this is as of today extremely finicky.

Future advances in interpretability may also lead to a reduced reliance on general models day-to-day: LLMs do well at tasks like recognizing and extracting entities from a text, it may be a matter of time until it's possible to extract the relevant circuits to do exactly that and package them into smaller, more efficient (and accurate) models. Rather than having general models deployed everywhere, we may end up with large models that are then strip-mined and repurposed for the specific task at hand.

Consider what is building. Instead of constantly scraping the internet and asking a model to produce answers, they are training a model to use a browser like humans would. Not much is known about this approach, pioneered by WebGPT, and how it scales to other domains. Does the Adept approach work for CAD or film production? In principle it could if given enough demonstrations. If someone makes a Hollywood-level movie purely from a prompt without intervening until the result is ready, I expect a hybrid model will get there before end-to-end models do. Replit recently introduced Ghostwriter, a system similar to Github's Copilot but that could become more powerful due to the fact that Replit is a fully integrated development environment: Replit knows what users type, what errors they get, what they run in the REPL. Replit is in a unique position here for now, but they may be limited by the kind of software that usually gets built on Replit. It would be interesting to see what happens if one instruments computers at Lockheed Martin while mechanical engineers are designing parts, and then uses that data for the CAD problem. This sounds farfetched, but RunwayML is in the same position Replit is but for video, so if one had to bet on someone making progress in automating end-to-end movie generation it would be them (and big tech companies).

I have different intuitions about systems that can generalize a lot and systems that are doggedly trained in narrow domains. What if we can get 90% of the way there with narrow AI? It may be more useful to have end-to-end agentic systems but even if this proves as hard as I think it will be, I am more optimistic about cobbling together these narrow systems with some human interaction between steps, in a way perhaps reminiscent of Drexler's imagined future of AI here.

I've been playing with GPT3 and image models a lot to see what they do and do not do. Most questions (or image prompts) I've posed to these models get a reasonable answer; at this point one has to actively try to mess around with the models to get it to say stupid things reliably. Models are also getting better at saying "I don't know" rather than hallucinating answers.

That said, the answers one gets from GPT3 have some bland quality to them. They can be helpful indeed but not mindblowing.

In the case of writing code, even if we haven't yet distilled Jeff Dean into an LLM, a nontrivial chunk of software engineering is looking up libraries and function calls in stack overflow and pattern-matching to the right snippets, and this is something LLMs can do today already, and we have only gotten started.

A generic heuristic I came up with is that AI will continue to struggle with tasks that humans can't do or plan how to do immediately. As an example, if you ask me to produce the SQL to select a column, filter by another, and compute an aggregate, where this involving joining two tables together, I can give you this answer without really thinking about it that much (I have written a lot of SQL!). GPT3 in fact will give you this:

If you asked me to write a piece of software to solve [this]( Advent of Code problem I would not be able to tell you as readily what the solution is, the code doesn't pop in my head fully formed, there's some thinking one has to do first (my [solution]( Advent of Code might be a fair benchmark for code-generation models, each question in AoC is self-contained and often not trivial, while at the same time being far from the complexity of writing a 10k LOC program.

Another heuristic is that models, by their nature, will continue to be deficient in "true understanding" in the sense defined earlier. In the case of LLMs this will manifest itself in there still being simple logical puzzles that humans reliably get right and LLMs do not whereas in the case of image models this will manifest itself as weird artifacts and absurd outputs that humans could easily tell are not quite right. I suspect that this lack of "true understanding" will harm model performance. It's an interesting fact that the symbolic models of old (GOFAI) do better in their domains than modern LLMs do. Humans have the advantage of both, fluidly moving from a symbolic/rational stance where concepts are held as fixed (I see a table in front of me) sometimes and as nebulous some other times (The table could be used also to sit on, as a source of wood, to stand on, or to not get wait if it's raining).

Sometimes when you see a table there's in fact a table there, other times it's actually an unconventional stool, the thing is knowing when to think in each way.

Lastly there's the slow progress, so far, in multi-task ML. I am more optimistic about forecasts on purpose-specific models than I am about general models. The state of the art LLM for interfacing with web-browsers won't be the same one developers will use to write code.

A high level framework to think about this is that the case where these models are particularly useful is when they are better than we are and we can trust the output. If they are worse, why use them? If we can't trust or verify their answer, even if they are better or know more than you, why use them?

The exceptions to this is where the model is still not uniformly human-level in a domain, but can still assist humans with subtasks within that domain. The issues with the upper right quadrant could be solved by finetuning and experts declaring that the model is as good as them. For example, with the help of doctors, models can be finetuned to predict illnesses from symptoms, then they can then sign an audit of the model. If users trust the panel of doctors, they may transitively trust also the model.

A key reason for recent hype is scaling laws: The fact that ML model performance by various metrics increases predictably with increased parameter count and number of tokens the model has been trained on (Kaplan et al., 2020, Hoffman et al., 2022). If scaling breaks, it could send AI back to another soul-searching winter as it has happened before a couple of times.

The Hoffman (Chinchilla) paper shows that some of the early enthusiasm regarding scaling by parameter count alone was premature: eventually data becomes the bottleneck and we have already strip-mined the entire internet for tokens. One way forward is getting models to generate more data: as I suggested earlier, diffusion models can be asked to produce individual objects and those merged into a single image, and then train the model to predict from a prompt with positional information ("there is a red apple to the left of the green apple") this resulting image.

Models can also be asked to judge their own output and finetune them on the examples the model itself considers accurate, which boosts performance across various benchmarks (Huang et al., 2022). I'm not sure if this will matter much in practice on the margin, because the models are already quite good at common sense reasoning.

The one domain of interest where more out of distribution data can be generated is coding. There does not appear to be barriers to scale code generation models if one can always generate more code or use test suites as an additional term in the loss function. It remains to be seen what kind of code can be generated with this approach: I see how models can get better at writing single functions, but going from there to writing LLVM or CAD software, that's a stretch.

What about getting models that can improve themselves? If one buys scaling maximalism this should not matter much, the ML-model generating ML model will tell you to give it more parameters. If it's a matter of architecture, we already have neural architecture search. If further innovations are required, especially to make the scaling itself occur we need models that understand GPUs, interconnects and the like as well as we do; the road to AGI passes through automating away the teams developing things like JAX, XLA, FasterTransformer or PaLM. Ultimately I think solving software engineering is not enough for AGI: one needs to solve science itself.

Back in 2016 (around when AlphaFold came out) I wrote some feats that I thought would be good benchmarks for AI. Number 2 was eventually achieved (Beating Starcraft). Number 1 (Beating a card game like Magic: The Gathering) is something that hasn't really been tried, but that now I suspect it is easier than it seems and could probably be accomplished if it was tried.

The point of this essay, why I wrote it, was to come up with a handful of things that I expect will surprise me if I get wrong. To hold myself accountable, I am willing to bet up to $5k on each of these .

To me, committing to making these bets is more important to actually making money out of them. Even if no one takes the bets and some of these come to be true by 2026, that will be a strong signal for me to consider my intuitions about AI development to be very misguided. One prediction market side gives strong AI a probability of 15% by 2026; this other one gives ~30% to AIs that can do sophisticated programming by the same date. Hollywood-levels movies by 2026 gets 32% here. These all seem very high to me. I chose the bets below as attempts to upper bound within reason what AI capabilities might be by 2026, so implicitly my own estimate that I will lose the bets is <5% likely?

When writing this essay I found myself going back and forth between various perspectives. There's "What can systems do today", "In which ways have forecasts been wrong in the past", "How fast has been progress", "How can we decompose progress into subcomponents; how do those change?", etc. Given what I've written above, I could also have seen myself writing a different blogpost that comes to different conclusions. For example, one could take the fast improvement in AI and the Imagen results and predict that by implementing my suggestions (generating better training data) we will solve compositionality and eventually have Hollywood-level movies by 2026. How does one aggregate all this information? How meaningful are the forecasts I make here? Is it all vibes and I'm just justifying them with arguments? It does feel like that. But had I written the opposite, that by 2026 we will have done all the things I am betting against, I would be doing the same thing: I would be justifying my alternative 'Wow GPT3!' vibe with arguments.

It's not even fully clear how exactly I should update my beliefs if I lose the bets above. I just hope I do somehow.

I took a number of blogposts and fed them to GPT3 up to a point, making sure to have at least ~700 tokens, then seeing what GPT3 would return. Below I show both the original paragraph that followed the input text as well as the GPT3 completion. I ran GPT3 three times and picked what I judged to be the best response. Each section is preceded by the last snippet of text in the prompt, for context. Maybe there are better ways of doing blogpost completions than this; one particularly skilled individual at prompting says it takes even months to learn how to do it correctly. The model is by no means fine-tuned on blogposts, with further expenditure it may be possible to improve these results.

You could just read the Cochrane review. But recall the questions you could ask an expert: Is the evidence good? Do the trials used in the review use the same kind of salt or a different kind? How much sodium is in the used salt? Do the trials use the same kind of patients? How much salt do you actually eat in a day? What is your sodium blood level? What is your blood pressure? The Cochrane review would not give you a good answer to these questions.

The Cochrane review could be refined to help you.It could provide links to the studies it used, it could provide links to studies that tested the same hypothesis that were left out of the Cochrane review. It could provide links to other reviews with the same hypothesis. It could provide links to reviews with the same hypothesis with different populations. It could provide links to studies that are in the same field with different hypotheses (for example, studies on salt's effect on blood pressure). It could provide links to studies on the effect of other dietary interventions on blood pressure.

This example shows that you could use reading to get a more nuanced answer than a single expert could. You could use it to get a more nuanced answer than a whole community of experts could. You could use it to get a more nuanced answer than a whole community of experts could, plus the other papers they have read.

But reading papers is slow. Going through the papers cited in the Cochrane review is slow. Going through the papers cited in those papers is slower. Reading papers on a new topic is even slower, because you need to gain the context to understand them. Even a systematic review with a lot of context will take days to weeks to read.

The Google approach yields 2.3g per day (FDA guidelines). But then you are of course smarter than this and you keep digging; you deploy the heuristic of inverting common knowledge (more salt is worse, period) and try to find evidence that actually, too little salt can be bad. You come across some work on that, showing that there's a U-shaped relation and that perhaps the right amount of salt to consume is more than the guidelines say. You feel smug and smart. You talk to some doctors that vaguely gesture at the U/J-shaped relation between salt and mortality. But some time after, you learn of a piping hot meta-analysis fresh off the press, analyzed and endorsed by this one Stephan J. Guyenet on Twitter, reasserting accepted knowledge: less salt is better.

This is the answer I would believe, but for this particular one, I started with an advantage because I have done the prior work of searching experts to trust in that one space (As it happens, my other go-to expert for nutrition matters concurs with Guyenet). Whether here one ends up with the correct answer or not would depend on how good one is at interpreting primary sources and how good one is at finding good experts. But here we don't have to talk to the experts, for this one very simple question that asks a relation between two variables, we can rely on short analyses from the experts; this both points to primary sources and explains where the contradictions may be coming from.

This is better in one regard than talking to the experts; they probably can't cite all these papers from memory, nor immediately address what the trouble is with the discordant paper you found that morning. Particularly illustrative was this podcast between Gary Taubes (Who subscribes to the "carbs are bad" view) and Guyenet (Who think carbs are fine in moderation). Guyenet came prepared with a list of arguments and relevant work he could point to where he had pre-answered potential counterarguments to his views. Imagine now trying to talk to either of them vs reading their writings, or writing a best evidence synthesis of what they are saying. It seems obvious reading will yield the superior understanding, but not necessarily a better answer and definately it will take longer.

My own text works out a concrete example, citing specific sources that inform my actual belief about salt consumption, whereas GPT3 goes on a generic overview of the pros and cons of reading reviews vs talking to experts, which my own blogpost already had done; and GPT3 should know because it was part of the prompt.

The NEPA process is triggered when a federal agency is planning a major action that will have significant impacts on the environment. The process begins with the preparation of an Environmental Assessment (EA), which is a document that assesses the potential impacts of the proposed action.

If the EA finds that the proposed action may have significant impacts, the agency must prepare an Environmental Impact Statement (EIS), which is a much more comprehensive document that assesses the potential impacts in detail and considers alternatives to the proposed action.

The EIS process can be quite lengthy, and often takes years to complete. After the EIS is completed, the agency will make a decision on whether or not to proceed with the proposed action.

If the agency decides to proceed, it must still follow the NEPA process by preparing a “finding of no significant impact” (FONSI) or a “record of decision” (ROD). The FONSI is a document that justifies the agency’s decision to proceed with the proposed action in light of the information contained in the EIS. The ROD is a document that outlines the agency’s decision and the reasons for it.

The NEPA process is intended to be open and transparent, and the public is encouraged to participate in the process by providing comments on the EIS. After the EIS is completed, the agency must consider all of the public comments before making a decision on the proposed action.

NEPA as it exists today has largely become a procedural requirement - NEPA doesn’t mandate a particular outcome, or that the government places a particular weight on environmental considerations [0]. It simply requires that the government consider the environmental impact of its actions, and that it inform the public of those considerations. NEPA doesn’t prevent negative environmental impacts, so long as those impacts have been properly documented and the agency has taken a “hard look” at them - as one agency official described it, “I like to say you can pave over paradise with a NEPA document.”

More specifically, NEPA requires that a “detailed statement” be produced describing any significant environmental impacts of “major” federal actions. “Major federal action” might be anything from:

Or anything else that could possibly have significant environmental impacts. In practice, little effort seems to be placed on determining whether an action qualifies as “major”, and anything that might have significant environmental effects in practice must be NEPA compliant.

On a first pass, there is nothing wrong with the GPT3 completion; Brian Potter talks about something different in the actual post, he starts with a high level conceptual argument for what NEPA is, making points that readers might be surprised to learn like "NEPA doesn’t mandate a particular outcome, or that the government places a particular weight on environmental considerations [0]. It simply requires that the government consider the environmental impact of its actions, and that it inform the public of those considerations.", then explains what major federal actions might be. Then Brian explains the EA->EIS->FONSI cascade, adding something that GPT3 did not consider: categorical exclusions. The paragraph where he does that is more directly comparable to GPT3:

The last paragraph is one that GPT3 seems to struggle with: giving concrete examples. I tried to feed GPT3 the entire text, up to and including "EISs have become long, involved analyses that take years to complete and are often thousands of pages in length. For instance, " and what I get is this (tried three times):

The original text gets us pages, time they take to complete, which is very relevant information, and a key fact about the absurdity of the system, a least in the 80s. GPT3 gives us just page lengths that are not even consistent across runs.

The “tragedy of the commons” model: imagine a pasture open to all, and that grazing it excessively will destroy it. Each herdsman seeks to keep as many cattle as possible, and the only consequence he foresees is his benefit.

In the economics literature this is called the “tragedy of the commons”, although the original example was a little different and didn’t involve tragedy. In political science it’s called an example of 'collective action', although I don’t know if this is standard terminology.

Suppose there are a thousand herdsmen, each of whom has a herd of a hundred cows. If they all go out and graze their cows on the pasture every day, the pasture will be destroyed in a matter of days. But each herdsman only cares about his benefit. He doesn’t want to sacrifice his hundred cows to help preserve the pasture, since the pasture is open to everyone and he can’t reap the full benefit. He knows that if he doesn’t graze his cows, the pasture won’t be destroyed, but the other herdsmen will keep grazing, and he will get nothing. Each herdsman reasons like this, and the pasture is destroyed.

Capitalism. Imagine a capitalist in a cutthroat industry. He employs workers in a sweatshop to sew garments, which he sells at minimal profit. Maybe he would like to pay his workers more, or give them nicer working conditions. But he can’t, because that would raise the price of his products and he would be outcompeted by his cheaper rivals and go bankrupt. Maybe many of his rivals are nice people who would like to pay their workers more, but unless they have some kind of ironclad guarantee that none of them are going to defect by undercutting their prices they can’t do it.

Like the rats, who gradually lose all values except sheer competition, so companies in an economic environment of sufficiently intense competition are forced to abandon all values except optimizing-for-profit or else be outcompeted by companies that optimized for profit better and so can sell the same service at a lower price.

(I’m not really sure how widely people appreciate the value of analogizing capitalism to evolution. Fit companies – defined as those that make the customer want to buy from them – survive, expand, and inspire future efforts, and unfit companies – defined as those no one wants to buy from – go bankrupt and die out along with their company DNA. The reasons Nature is red and tooth and claw are the same reasons the market is ruthless and exploitative)

From a god’s-eye-view, we can contrive a friendly industry where every company pays its workers a living wage. From within the system, there’s no way to enact it.

(Moloch whose love is endless oil and stone! Moloch whose blood is running money!)

The example from GPT3 is not wrong and is thematically accurate: In the original post Scott is listing examples of tragedies of the commons, strangely without ever using the word tragedy of the commons; in fact the GPT3 example (pasture and overgrazing) was the same example that the very first tragedy of the commons writing used to explain the concept. How does this example compare to the other few that Scott has? I'd summit that it is less interes

Images Powered by Shutterstock