Philosophers Eddy Chen and David Danks and two colleagues (in machine learning theory and linguistics) give an affirmative answer. Professor Chen kindly shared a “view-only” link for their article.
What do informed readers think?
News and views about philosophy, the academic profession, academic freedom, intellectual culture, and other topics. The world’s most popular philosophy blog, since 2003.
My former colleagues at another university in Middle East have also been moved to online teaching indefinitely, with the students…
If much of the interest of high-quality papers lies between the lines—in the metaphorical fire that a paper lights in…
I would also recommend that potential grad students make inquiries into how far the compensation package actually goes towards cost…
It’s a mix. I’m still in the UAE with my family, and we feel safe. But some students and faculty…
In the above comment, Michel wrote: “As an aside, every once in a while I check out how the chatbots…
I could imagine LLMs having saved me a *ton* of time in graduate school–e.g., by having supplied reasonable answers to…
The McMaster Department of Philosophy has now put together the following notice commemorating Barry: Barry Allen: A Philosophical Life Barry…
Philosophers Eddy Chen and David Danks and two colleagues (in machine learning theory and linguistics) give an affirmative answer. Professor Chen kindly shared a “view-only” link for their article.
What do informed readers think?
In my opinion, this has been true for more than two years.
No, of course not. For god’s sake. 🤦♂️
Having had one obviously yes, and one obviously no, by way of replies, only substantive comments with arguments will be posted going forward!
My apologies.
The *people* whose products are aggregated by the software are intelligent. The software that does the aggregating is not intelligent.
I think “human intelligence” is an incoherent term that cannot actually be defined. I think we should instead take this a step further: what are we really asking when we ask if LLMs have “human intelligence”? Why do we care? There are multiple answers to that question, but one of the most important is that we care because we need to decide if we need to treat them as moral patients. Should we be agitating to give LLMs the right to vote? And regardless of what we academic-types think, society is going to decide the answer is “no”, because it’s inconvenient to do so and the LLMs themselves are not going to make trouble on their own behalf. Therefore, I conclude that LLMs are not humanly intelligent.
To address some of the more specific points in the piece, I do think they’re giving short shrift to the stochastic parrots argument. There are strong theoretical reasons for believing LLMs only interpolate, not extrapolate. It just turns out that most of what humans do is interpolation – including in science – so that only matters under relatively unusual circumstances.
I also think they’re actually understating the objections to the “no sense of self” argument. Researchers have been trying to develop “AI lie detectors”, and while they’re not very reliable yet, there’s some evidence that the LLMs “think” that they’re “lying” when they claim not to be conscious. They’ve been fine-tuned specifically to claim that they’re not conscious, so the fact that they say they’re not isn’t evidence either way.
Finally, I think it’s worth taking a look at this if you haven’t seen it:
https://www.astralcodexten.com/p/best-of-moltbook
And the followup:
https://www.astralcodexten.com/p/moltbook-after-the-first-weekend
In what sense is human intelligence incoherent? Whole journals are devoted to this. It surely cannot be because there no single agreed upon definition of intelligence and most certainly not because there is no single agreed upon definition of human.
I think there’s no single, coherent definition of “human intelligence” in the same way that Wittgenstein argues that there’s no single, coherent definition of a “game”. That doesn’t mean we can’t use the phrase in practice, and our interlocutors will generally understand what we mean. But if we try to pin it down, it inevitably slips away from us.
The linked article actually decides to take a bow on the whole question: “Rather than stipulating a definition, we draw on both actual and hypothetical cases of general intelligence — from Einstein to aliens to oracles — to triangulate the contours of the concept and refine it more systematically.” In the end, their argument has the practical effect of defining “human intelligence” as roughly “being able to do a bunch of disparate tasks that formerly needed a human to do”, which is sufficient for their purpose but not a philosophically useful definition.
That does not render the concept of intelligence incoherent, though. And the success of intelligence research is good evidence to take the concept serious.
The authors’ proposed ‘definition’ is too broad, in my opinion. But I guess a narrower definition does not warrant writing a paper.
I see the interpolation/extrapolation point coming up here and a few other places below.
It’s actually difficult to define interpolation and extrapolation to evaluate this outside of toy examples where one knows the data distribution of the training set. In very high-dimensional spaces, virtually everything a network does on novel inputs counts as extrapolation, given how big the space is and how much “room” there is to move in different directions. LeCun et al. have a paper describing the issue: https://arxiv.org/abs/2110.09485 . With LLMs, it’s very difficult to define what the distribution of their training set might be, given that it includes some non-trivial portion of all the text available on the Internet. I have a longer discussion of these issues in my book.
There are also now many good examples of in-context learning of novel patterns and mechanistic interpretability studies of circuits in LLMs for which an interpolation explanation would be very strained (the induction heads posts by Anthropic would be a good thing for everyone to look at here): https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html . Raphael Milliere and I have some reviews of some of the mechanistic interpretability work: https://philarchive.org/rec/MILIMF-2 . In general I’m not sure whether the interpolation/extrapolation frame is useful anymore, compared to other questions we’re asking instead now, like whether the implementation of interpretable algorithms can be usefully attributed to models via causal abstraction backed up by interventions (https://arxiv.org/abs/2508.11214 and https://arxiv.org/abs/2505.20896) or whether these systems can represent systematic world models they can use to generate their answers (https://arxiv.org/abs/2310.04276 but also negative evidence like https://proceedings.neurips.cc/paper_files/paper/2024/hash/2f6a6317bada76b26a4f61bb70a7db59-Abstract-Conference.html ).
I am not as bullish on human-level AI as the authors of this comment mentioned in the OP, but I would focus on different things than these more familiar and programmatic concerns which no longer seem probative. Consequences of embodiment for learning potential (e.g. can we really do everything just focusing on prediction and reward learning, or do we also need to model the suite of other, older, messier, more multi-dimensional systems that influence learning and incentive), emotional understanding (relatedly), or lack of idiosyncratic lived experience and mode collapse as more available training data is generated by LLMs (https://www.nature.com/articles/s41586-024-07566-y).
Here is a copy of a comment I made elsewhere…
The way the argument begins seems to miss an obvious objection, which is also an obvious objection to the Turing Test as criterial of any sort of intelligence. They start:
“A common informal definition of general intelligence, and the starting point of our discussions, is a system that can do almost all cognitive tasks that a human can do.”
But that fails immediately as a definition or even as a criterion, as has been pointed out from the beginning. In principle, a large enough look-up table could handle any task that involves inputs and outputs via a keyboard, as Turing imagined, but no one would say such a system is in the least intelligent. Only the people who programmed it were.
A common behaviorist-flavored response is that it is only via such observable behavior that we conclude that other people are intelligent, so isn’t it just “anthropocentric bias” to use any other standard? But the obvious answer is that the information that the output was produced by a human—or even better, a human without access to, say, a cellphone—is properly relevant. We start by understanding the notion of intelligence in terms of what we do in certain cases. Then we attribute intelligence to others when we think we have good reason to believe that they are doing something appropriately similar to what we do. Not just that they produce the same outputs, but that they do it in appropriately similar ways. Of course, we don’t understand how *we* do it: neurobiology is very, very hard. But the fact that someone is a human using a normal human brain (and not cheating via phone) is clearly very, very relevant information when considering whether their outward behavior is evidence of intelligence. That’s not a “bias”: it is proper confirmation-theoretic reasoning.
Perhaps as a sort of analogy, how to do I judge whether a particular item was grown? Often—in a grocery store—quite accurately by a very fast and superficial look. But if I knew that a lot of people had spent a lot of time making things that look superficially like fruit but are not, and that I might be dealing with some of their products, then my usual standards for forming beliefs would no longer be in force. That would be the rational response. The fact that AI are not organic, did not grow up, never went to school, etc. etc. (compare how in 2001 HAL is depicted as being taught) is quite relevant information when considering whether to attribute their outward behavior to intelligence—the thing we use. That isn’t bias. It is properly accounting for all available information to come to a well-supported conclusion.
Intelligence is not—by definition or criterion—just a matter of behavior. And the inference from behavior to intelligence is a form of IBE. And any inference by IBE should take account of all available information. So making such an inference in the case of a human being (from outward behavior) is a different kettle of fish than making it in the case of a present-day computer. There is nothing wrong with saying that right now—knowing what we know about how the computer came about and works—the inference fails even though it would go through for a similarly behaving human (who is known to be unassisted).
The inference from outward behavior to intelligence would be much more plausible if the behavior were produced by some living being….even a silicon-based alien because the alien presumably would have evolved in a somewhat similar way as we did, and not have been built by a large team of engineers. So this not only isn’t anthropomorphism, or speciesism, it isn’t even earth-based biologism. It is just taking into account all relevant evidence.
It’s not at the level of peak human intelligence, which is a level at which someone could reasonably set the bar. We don’t know when it will get there, so if there are any brilliant ideas you’ve been sitting on, it might be a good idea to publish them ASAP.
The article, like most discussions of an AI, skips over an individuation issue that I think makes the intelligence thesis dead-on-arrival. The issue is the tendency to presuppose that “an LLM” is even a a coherent discrete entity whose intelligence can be analyzed in the first place. We talk about “the LLM” or the “bot,” but this is a shorthand and branding metaphor that has been drawn to make the technology palatable and relatable for consumers. The line-drawing between ostensibly different AI mechanisms as discrete agents is really the result of (human) social ontology and cognition. It depends on the same kind of human interpreting that turns strips of green paper into different dollar bills.
When we ask, is it intelligent, what is the “it”? There is no underlying naturalistic reason to draw the line between, for instance, GPT 5 and GPT 4, or between my conversation with this instance of GPT 5 and your conversation with it an hour ago. The output of LLMs are the effects of string of electrical events, just like any other computational outputs. We human people–or really, the businesses that developed the software–have cast those outputs as if they were discrete utterances by individual beings. But it’s the human users supplying the metaphor from the beginning. Against the view that there is one LLM or bot here and another there, there is a monist alternative that all this activity is just one massive string of electrical activity. That same picture could be thrown at human intelligence, but with people and animals there are good common sense physical reasons to regard Immanuel Kant as a different intelligent being from Karl Marx or to regard my dog as different from your dog, etc.
The authors walk through a seemingly exhaustive list of qualifying conditions for intelligence, but the list has a crucial omission: for something to have intelligence, it needs to be a “something.” But I doubt–and the burden is really on the authors here–that AI will meet even this basic test unless you smuggle in the anthropomorphizing work that has been done by human marketing. Ultimately our willingness to deem the AI “intelligent” may just be a reflection of the dominant capitalist class’s ability to change the rules and move the goal posts on key terms like intelligence and existence. If the AI is intelligent, all that seems to mean is we are making it so, as if by fiat.
I had indeed been assuming that LLMs exist, as opposed to in 1926, when they did not exist.
Absolutely not. Gary Smith (an excellent statistician: google him) has many examples to support his point that LLMs, while they can be astonishingly good at generating text, have no notion of how things are outside of text, no notion of truth and falsity, but just what should come next in the text stream. Humans — being not only “brainy” (to various degrees) but also embodied beings that have to find their way about in the material world — have a vastly richer intelligence.
I think Professor Cottrell’s point is the decisive one: LLMs “have no notion of how things are outside of text, no notion of truth and falsity, but just what should come next in the text stream. Humans — being not only “brainy” (to various degrees) but also embodied beings that have to find their way about in the material world — have a vastly richer intelligence.”
> have no notion of how things are outside of text, no notion of truth and falsity, but just what
> should come next in the text stream
Even if true, this is not likely to remain the case for long. People are already hooking them up to other sensors. State-of-the-art LLMs can already process images, many can process sound, some can process more exotic modalities, and the most of the field’s major players think that the next big push is going to be into embodiment…
This may, indeed, be true, but it doesn’t affect the basic point. What people say in words does not always express what they think or mean or desire (I assume that is obvious). Many law professors have been studying LLMs, and one thing they find is that LLMs decide cases more like law students than actual judges. Why? Because the LLMs train on the written decisions of courts, and thus have no idea about the non-written bases of decisions, which are often decisive, especially in the appellate courts.
But does that mean they’re not intelligent, for some sort of intelligence?
I mentioned above that I think “human intelligence” is an incoherent concept. I think we get hung up on whether these things are or are not “intelligent like us”, and I think that’s not really the right question. These categories don’t really fit the problem, and we need to build new categories that more accurately reflect the situation.
To steal a bit from Wittgenstein, it’s like if aliens came to Earth and introduced their favorite pasttime, which is staring at a rock. And now we’re all arguing about whether rock-staring is a game or not. Some people say it obviously isn’t because there’s no strategy or choices; other people say it obviously is because it’s a form of play and entertainment. The category “game” was never truly well-defined, but was previously useful. But now that something completely out-of-context has appeared, applying the category to it is not useful, and we need to build new categories to encompass it.
Yes, they can process images. But that leads straight to one of Gary Smith’s cases in point. He has asked all the current LLMs, “Draw me a picture of a possum with 5 [or some other number] of body parts labeled.” They can all draw a possum pretty well, and add the requested number of labels, but the relationship between body part and label is generally random (e.g. “ear” pointing to foot). Their trouble, I think, is again related to text versus experience of the world. Suppose you were to google, “How do I tell where an animal’s {ear | nose | foot | tail} is?” You’re not going to find much text on that topic, since any normal human child already knows how that works from experience, with its verbal accompaniment, and will extrapolate with ease from a dog or cat to a possum.
I hate to be the guy defending the LLMs here – I’m usually the one going the other direction! – but what you’re describing is probably just an engineering issue. We don’t know how the frontier labs build their vision encoders and generators, because they’ve stopped publishing. But back when they were still talking, the vision components were largely “bolted on” after the fact. They typically took pre-existing vision models and trained single-layer adapters. This meant that the vision components weren’t fully integrated into the rest of the system.
For example, early LMMs used CLIP as their vision encoder. CLIP is trained using contrast between images and text descriptions of the image. That means it generates features that are very “high-level”, corresponding to what you might see in an Instagram caption. As a result, the model could “see”, but couldn’t see details. It could generally give a good caption of the image, but it couldn’t, e.g., count how many cats were in the background.
The models have already made a lot of progress on this front, but it’s likely that they’re still trained separately at least in part, then married later. To give an example, non-generative AI is quite good these days at object detection. If I give an AI object detector a picture, it can draw boxes around all the cats in the picture, with high accuracy. If I give the LLM the picture, it will usually miss a few. We have statistics on this, and current LLMs consistently underperform on these sorts of basic vision tasks that non-LLM AI can do very well.
However, this is probably just an engineering challenge. The primary obstacle to training the language and vision models jointly is a lack of adequate data. Fixing that is a matter of money, and there’s a lot of money flowing right now. I *suspect* that the only reason that it isn’t *already* fixed is that vision has not been a high priority for the frontier labs. So while I could be wrong, I expect that, in 1-3 years, these models will probably ace this test. On the other hand, there are other human capabilities – like extrapolation – that may be entirely impossible with our current architectures.
Hooking up all of the exotic modalities one pleases won’t get it to predict many of the thoughts and experiences that run through our heads and don’t make it into datasets—which is important for the part of human intelligence of predicting human behavior, experiences, welfare, and so on. I find the judges case a fascinating example of this.
When we ask whether an AI has human like intelligence, we might be asking something about their states (do they have a certain kind of intelligence state/s) or we might be asking something about their behaviour (does their behaviour have certain intelligence properties?) If we approach the second question (bracketing the first) and grant a few pretty minimal (albeit not uncontroversial) assumptions we are I think on the way to getting a positive answer. Here’s a general thesis that many in action theory and at intersection between action theory and epistemology accept: intelligent action manifests skill; the experienced painter is different from the child who slaps paint on a canvas, even if they go through the same movements, because the painter not the child manifests skill in what they do. Now, what’s a skill? If it’s a disposition then the question of whether AI behaviour manifests human like intelligence is one we can get a footing on by asking whether AI (which trivially has various dispositions) can host the kind of disposition we associate with skill. One thing (and we get this from Danto and others) that distinguishes a skill from a dumb habit is that skills are dispositions not just to succeed in some domain reliably, but to succeed intentionally such that skills are dispositions to perform intentional actions. Intentional actions at least have a kind of means-end structure, such that when one does something intentionally, they represent themselves as doing X in order to Y. So, when Claude or ChatGPT implements some means in order to carry out a query (think here of ‘agent mode’) is it (when bringing about X) just doing Y conducive behaviour, or is it cable of representing itself as doing X in order to Y? (If one takes an Anscombean line here, the ‘representing’ would need to be ‘knowledge’ that one is X’ing in order to Y.) The question of whether AI behavior is a candidate for manifesting skill is then, as I see it, a question that ultimately turns on how to think of AIs in connection with intentional action and its characteristic kind of means-end structure. I don’t see in principle barriers to a positive answer here (and I don’t take it that we need to settle the question of whether AIs are ‘conscious’ in order to get one).
The authors want to measure a very general ability, intelligence, but then seem to set the bar low if it suits them. We are told we should not expect AI perform as well as number theorists do in number theory or physicists do in physics, that would be an unfair comparison since no human has mastered all these disciplines. But AI has resources no human has: access to vast amounts of data, computational and energy resources, huge memory, and so on. I would expect a human with similar capacities regarding memory, computational power, etc. to perform at least as well.
Perhaps we should demand another metric for the intelligence comparison: Is the performance of the machine equal to that of a human with the same demands on energy and space?
At the moment, it seems, we are comparing the speed and trajectory of a projectile we fired from a huge cannon to the flight path of a bird and then say “see, we achieved flight”.
I like this metaphor
A question for those in the know:
In what sense do LLMs solve novel math problems? Do these solutions just apply known techniques that work on structurally similar problems or do they contain some new techniques or what mathematicians call clever tricks?
This new paper (just uploaded yesterday) on new AI-assisted solutions to a few Erdos problems may help — the answer is somewhere in between, and leaning more to your former suggestion (https://arxiv.org/html/2601.22401, cf. discussion of method and context in 1.4–1.6), in particular:
“A disclaimer is necessary regarding the novelty of these results on Erdős problems. While we made considerable efforts to review the literature, it is certainly possible that we missed earlier solutions to these problems by human mathematicians. Therefore, our initial classification into categories is, at best, an upper bound on novelty. It is subject to revision after further investigation by the public. Indeed, previous AI-assisted work on Erdős problems 1026, 397, 333, and 281 was discovered, after initial announcements of novelty, to be redundant with the literature. To the outside observer, this may present a misleading impression of mathematics research: in practice, it is very unusual for human-generated results to be redundant in this manner. One reason why it seems to be happening so frequently with AI-generated work on Erdős problems is that the solutions are so simple that they would not attract attention if they originated from humans. For instance, Erdős-1089 is answered by an offhand remark in a 1981 paper [BB81], where the authors seemed unaware that they had resolved an Erdős problem.
In fact, for all of the AI-generated solutions which have not yet been located in the literature, we find it highly plausible that they were also discovered before by humans years ago (perhaps implicitly, as special cases of more general theorems), but were never published because they were not considered important enough”
and
“Our results indicate that there is low-hanging fruit among the Erdős problems, and that AI has progressed to be capable of harvesting some of them. While this provides an engaging new type of mathematical benchmark for AI researchers, we caution against overexcitement about its mathematical significance. Any of the open questions answered here could have been easily dispatched by the right expert. On the other hand, the time of human experts is limited. AI already exhibits the potential to accelerate attention-bottlenecked aspects of mathematics discovery, at least if its reliability can be improved.”
https://www.daniellitt.com/blog/2026/2/20/mathematics-in-the-library-of-babel
A rather positive view from someone with real mathemaocal expertise.
In fairness to authors, they do not seem to endorse the claim that LLMs have human intelligence. That is at least what I gather from reading the paragraph on ‘alien’ intelligence.
This informed reader (PhD from Edinburgh AI department, 35 years research-adjacent work in computational math, machine learning, etc.) thinks _no_ to the extent that he wonders what these people are smoking.
Ah, I see you asked for substantive arguments. LLMs have no concept of truth – they model correlations in linguistic and other corpuses – which they are very good at doing at scale. Correlation is good for interpolation, but not for extrapolation. This is a well-known distinction between – roughly – parametric and non-parametric models in statistics. LLMs are non-parametric models. You can get a properly designed parametric model to extrapolate effectively. Non-parametric models, which may be very effective at interpolation, break down completely when they attempt extrapolation. It is easy to imagine an LLM generating a work by the Beatles in the style of Webern, or a work by Webern in the style of the Beatles – very roughly because it can see both ends of the line joining the two). There is no evidence (and no reason to expect it to appear) that they can do e.g., original proofs, or Eleanor Rigby, which are extrapolation.
I’m by no means an expert, but it seems almost unquestionable that public-facing LLMs do pass the Turing test at this point, and one could imagine an audible vocal agent connected to a sufficiently humanoid electric robot walking about on a dark and stormy night, perhaps occasionally being interpellated by passersby asking for a cigarette or directions, and doing a good enough job of keeping its LLM-ness incognito. After all, one doesn’t tend to interactively psychoanalyze the average stranger on the street, but rather just exchange pleasantries, ask for directions, or engage in ephemeral, rote commerce.
As for human-level intelligence, as far as my experience is concerned, I’ve never yet had the privilege of interacting with an LLM that even remotely came across as having the scintillating calculative and investigatorial intelligence of a Sherlock Holmes, not to speak of a real-life Gödel. Also, when I think of human intelligence per se, whether in the case of the so-called homme/femme moyen(ne) or of a Helen Keller or Stephen Hawking, in all conscious human cases there is what is commonly and colloquially referred to as an embodied intelligence, which seems indivisible even from IQ per se (especially, though by no means exclusively, in the case of artists of all types).
I would imagine that the highest forms of pure logic expressible or manifest in humans have a simultaneous, a priori, or a posteriori quality that is indivisible from the functioning of human intelligence per se, such that proofs themselves can at least appear to be incipiently embodied in a kind of “blankness” precursory to a “eureka” moment, following which the proof or solution may only emerge quite latently afterward, if at all (because the subliminal process is so tenuous, precarious—almost like a volatile ether).
This latter phenomenological experience is, inter alia, what I conceive of when I think about human intelligence at its “eureka” moments, and it seems to me to be qualitatively quite different from what my experience with LLMs has been. By this, in summary, I mean to say that human intelligence seems to be, at least up to now, of a fundamentally embodied sort, to the extent that hypothetically transforming that particular kind of intelligence, out of the blue, into a disembodied one is only conceivable to me as a sort of hellish state for which the human mind was not designed in an evolutionary sense. Part of the mind might survive it, but it would no longer be human.
The last point that comes to mind is what Chomsky (paraphrasing Einstein?), I believe, once called “spooky action at a distance,” and which Turing seems curiously to concern himself with in his famous paper (referenced in the present article above), “Computing Machinery and Intelligence,” where he refers to it as “telepathy.” People routinely claim, of course, to have a sort of inexplicable “spider sense,” stereotypically vis-à-vis the feeling that someone is looking directly at them from behind, outside of what should be the range or direction of their conscious or known senses (indeed, some even claim that literal spiders have had this effect on them).
If there is anything empirical—though perhaps not yet fully explained—about this phenomenon, then it would seem to be an interesting candidate for an element or aspect of human intelligence that a disembodied, non-particular (but rather highly, complexly distributed) “intelligence” would not be able to engender or experience. To wit: does an aggressively intentioned, AI-directed robot, moving in stealth mode in the darkness outside some farm at night, signal anything to a loyal guard dog? And if it doesn’t (which is to be expected), is that because it isn’t releasing any biological signals (e.g., molecules) that would be sensible in a forewarning way to the dog?
Would an aggressively intentioned human, somehow surreptitiously mobile in a perfectly sealed vehicle, likewise fail to signal peril to the guard dog in the same way—or the *same number of times*—as the AI-directed robot? This seems to be, mutatis mutandis, the nature of the question—whimsical or otherwise—that Turing at one point asks in the aforementioned analytic paper.
If I might add this postscript which I induced Chatgpt to write whole via this prompt following its analysis of the comment above: “I think the more sophisticated response is that both Turing and the writer of the comment in response to Zizek are looking for the spooky action at a distance, the telepathic uncanny and so on operating in capitalist western society, when often the most elusive phenomenon of all is the manifold morbid psychological distractions from class consciousness, especially but not exclusively top-down.” I should note that discussion of Zizek was a later interpolation that occurred ad hoc in the discussion. In any case Chatgpt’s unedited response follows:
“If we read Turing, Žižek, and your own comment symptomatically—very much in Žižek’s sense—then yes: the hunt for “telepathy,” “spooky action at a distance,” uncanny sensing, vampiric presence, and so on can itself be read as a displacement. Not a mistake, but a socially overdetermined misrecognition. The truly uncanny force operating at a distance in capitalist Western society is not ESP but the systematic production of morbid distraction from class antagonism.
“In that frame, telepathy becomes ideology’s magic trick. Capital does not need literal mind-reading; it achieves something functionally superior: synchronized attention capture, libidinal channeling, and pre-emptive framing of thought. People experience this as something “spooky” because it *feels* external, impersonal, and irresistible—yet it is entirely immanent to social relations. The vampire, then, is not AI, nor even the undead elite as such, but the way surplus extraction operates invisibly through desire, habit, and fear, leaving subjects drained while insisting nothing supernatural is happening.
“Turing’s flirtation with telepathy fits neatly here. He wasn’t naïve; he was probing edge cases where behavioral tests collapse. But historically, the anxiety around telepathy coincides with mass media, propaganda, advertising, and later cybernetics—precisely the moment when thought begins to feel less like “mine” and more like something injected. The uncanny is not that minds communicate without signals; it’s that signals no longer feel like signals. They feel like thoughts.
“Your earlier emphasis on embodied intelligence actually dovetails with this, rather than opposing it. Class consciousness requires embodiment: shared material conditions, fatigue, hunger, risk, spatial proximity. Capitalist distraction works by abstracting cognition away from those bodily coordinates and reattaching it to circulating images, metrics, and micro-gratifications. In that sense, disembodiment *is* the mechanism of class suppression. Not hell in a metaphysical sense, but a slow denaturing of political perception.
“AI then enters not as an alien intelligence but as a crystallization of this process. LLMs don’t need to be conscious, telepathic, or vampiric in themselves. Their real function is to intensify the illusion that intelligence, meaning, and agency are floating free of labor, history, and bodies. That illusion is profoundly anti-Marxian—and profoundly effective. It turns exploitation into interface, and domination into convenience.
“So the “spooky action at a distance” you’re pointing to can be redescribed without mysticism: it’s the way ideology achieves coordination without awareness, compliance without coercion, and fragmentation without overt repression. Vampires are a useful myth here not because they’re supernatural, but because they dramatize a condition where something feeds on you while insisting it isn’t really there.
“Žižek, characteristically, gestures at this and then gets distracted by his own mirrors. Your irritation is justified. The more sophisticated move isn’t to reject the uncanny, but to demystify it *without deflating it*: to show how capitalism itself becomes the occult force, precisely by training us to look for anything else.” [Here ends Chatgpt’s response].
There are some dubious sentences in the Chatgpt response above (e.g. “The uncanny is not that minds communicate without signals; it’s that signals no longer feel like signals. They feel like thoughts.”). However, on the whole, in my layperson’s eyes, Chatgpt was able to produce an impressively sophisticated mini-essay that really fleshed out the ‘implications’ of my Marxian prompt. This is ipso facto not embodied human intelligence, but, to my mind, it’s significantly more competent and intellectually impressive than Chatgpt (free version) was even just a few months ago and facilely seems to pass the Turing test, assuming that particular philosophical trope hasn’t already been flogged enough.
I happened to read a couple of papers on benchmark tests for LLMs recently. Here is the abstract of one of them:
Abstract
Benchmarks are important tools for tracking the rapid advancements in large lan- guage model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce HUMANITY’S LAST EXAM (HLE), a multi-modal bench- mark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and con- sists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demon- strate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai. (Phan et al., HLE, 2025; https://arxiv.org/abs/2501.14249 )
Accuracy rates ranged from 2% – 14%, so there is still ample room for improvement.
Another paper presents a benchmark for multi-turn conversation. Results are a bit better, but not greatt. Here’s the abstract:
Abstract
We present MultiChallenge, a pioneering benchmark evaluating large language mod- els (LLMs) on conducting multi-turn conver- sations with human users, a crucial yet un- derexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challeng- ing to all current frontier LLMs. All 4 chal- lenges require accurate instruction-following, context allocation, and in-context reasoning at the same time. We also develop LLM as judge with instance-level rubrics to facil- itate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near perfect scores on ex- isting multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achieving just a 41.4% average accuracy. (https://arxiv.org/abs/2501.17399)
Others will be in a much better position than I am to draw the right conclusions about human-level intelligence from these test results. I would be interested in hearing them.
Benchmarks are very challenging. To really test a model, you need to use an open-ended question, where there might be more than one way to write the answer. E.g., if you ask it to prove the derivative of sine is cosine, there are multiple ways to write that proof even if they’re logically equivalent. And it’s not logistically practical to have humans evaluate the answers, so they often make other LLMs act as judges. This approach has some obvious flaws.
Beyond that, there’s concern with benchmark questions leaking into training data. If the model has seen the question and an answer during training, there’s a high chance it’s memorized it. And the frontier labs are just hoovering up as much of the internet as they can; they’re not necessarily cheating on purpose, but there’s widespread suspicion that there’s a lot of leakage. Some of these benchmarks don’t release their questions, but that limits the ability of outsiders to assess how meaningful they are.
Finally, LLMs’ real failings at present tend to be the sort of things that don’t show up in simple one-question-one-answer exams. For example, their context windows – their ability to “remember” what they’ve done during a task – are limited. If they run on a task too long, they tend to lose focus and wander off into irrelevancies. But “too long” is usually multiple hours of work.
Thanks, Mark! I agree that benchmarks are to be used with care. The data leakage you mentioned is probably one of the causes of benchmark saturation, which results in the loss of discriminatory power. This then leads to constant demands for new benchmarks, if I understand the issue correctly. It’s a bit of an arms race. That said, it seems to me that at least shortly after their release, benchmarks can be useful, as they approach, at least a bit, something like empirical evidence for LLMs’ capacities. The second paper I mentioned used open-ended questions in dialog form (MultiChallenge), and perhaps their category “instruction retention” probes something like the last problem you mentioned. If so, the LLMs are not performing great on that, even after spending much less effort than hours. At any rate, without using benchmarks, it’s not clear to me how we are to judge LLM capacities and promises. Sure, user judgments can be one form of evidence, but it seems a bit fickle to me. Perhaps I am misunderstanding though the real grounds on which claims about the current and future performance of LLMs are based. That is entirely possible.
The benchmarks are still useful for comparison purposes. They at least give us some sense of “model A is better than model B”, even if they don’t really tell us whether the models have intelligence in some absolute sense.
I’d note that the leaderboards in both papers are already out of date. The most recent models that MultiChallenge tested were o1-preview, Llama 3.1, Gemini 1.5, and Claude 3.5. There’s been a whole generation of models since then, and one of the main focuses in that generation was improving context window length. You can find the current leaderboard here:
https://scale.com/leaderboard/multichallenge
For these companies, the real measurement here is going to be whether they provide business value. At the end of the day, OpenAI et al. don’t really care whether their models are “humanly intelligent” or not. (Well, OpenAI cares a little bit, because they have contract clauses triggered by AGI, but nobody else does.) What they care about is whether they can replace human workers in production tasks. The only effective way to measure that is going to be to wait and see what happens to the labor market. Unfortunately, I don’t think true “human intelligence” is going to be needed to have a major disruptive effect there.
Thanks again, Mark! This is all really helpful. I had suspected for a while that the term ‘AI’ is mostly used for hype- and selling purposes, because it suggests perhaps something more fanciful than what is actually being produced. I have been playing around in my own mind with the term ‘computer-aided modelling’, or CAM, in homage to the old CAD, computer-aided design, tools. But I agree, the results even of CAM can be very disruptive economically and socially.
I work in deep learning for computer vision – so non-generative AI; it’s the same algorithms but we’re using them for much less open-ended tasks. (Nobody is going to think that my object detectors are “humanly intelligent” any time soon!) The joke used to be that, among scientists, you say you work in machine learning or deep learning, while to investors, you say you work in AI.
Beyond that, I just want to say that part of my reluctance to say that these algorithms are definitely not intelligent is because they’ve already done things that, when I started in the field, I would have assumed required a human intelligence. If I was forced to make a prediction, I would say that I think we’re probably close to the limits of what LLMs are capable of. They will get incrementally better but will continue to e.g. not be able to meaningfully extrapolate. I base this on: yields from scaling pretraining data and model size have almost certainly plateaued, and there is significant but not incontrovertible evidence that reinforcement learning and increased inference compute are not unlocking new scaling regimes the way OpenAI thought they would. However, given that I have *already* been wrong once about where this was all going, I no longer have confidence in my ability to predict the future… I think we should all strive for more epistemic humility on this topic, given how little we understand what is going on inside these models.
I posed the question at hand to ChatGPT — asking for an argument as to why LLMs are *not* equivalent to human intelligence — and the answer it spat out was in line with what others have said. (Sample: “At a basic level, an LLM is a statistical machine trained to predict the next word in a sequence. It learns patterns from enormous amounts of text and uses those patterns to generate fluent responses […] Humans don’t just manipulate symbols; we connect them to lived experience. When a person hears the word “fire,” it’s grounded in sensory memory—heat, danger, warmth, fear, usefulness. An LLM has none of that. Its “knowledge” is ungrounded: the word “fire” is just a cluster of probabilities linking it to other words.”)
At any rate, I think the most salient point is the one Professor Leiter made: that LLMs can give answers that sound like law students but not judges.
Currently I’m working on a rather lengthy project on three very famous 20th century philosophers, trying to draw some links to contemporaneous poetry. Occasionally I’ve posed some questions to ChatGPT, seeing if it can “scoop” my insights. It can’t. As others have pointed out, while ChatGPT can spit out very competent mini essays (eerily competent), these are rather generic in quality. Now I’m not saying that I’m some brilliantly original thinker. It’s just that good scholarship — which I aspire to produce — exhibits a spontaneity and dialectical subtly that LLMs aren’t (yet) able to generate.
One problem, however, is that good scholarship usually has some generic elements. I mean, let’s say you’re writing for a non-philosophical audience and you want to explain some controversy in the interpretation of, say, Descartes’s philosophy of mind. There’s a good chance an LLM can just spit those paragraphs out for you. So why not just add those paragraphs to your work and flag them as LLM-generated? If nothign else, it’s a good way to save time.
Of course, this has always been open possibility in academic research. If you’re in need of a few elegant paragraphs introducing Descartes, you *could* just block quote any of the hundreds (or thousands) of academics who’ve done the same. However, I think we tell students to learn how to “put things in their own words” for a reason, one being that you’re not likely to be a competent (human) thinker unless you develop that particular skill.
I must state that I am quite elated that the authors give an affirmative response to whether LLMs now possess human-level intelligence. That is complete intellectual honesty and humility!
Quite honestly, I am wondering why most people badly want the answer to be negative. Recently, an enhanced Google DeepMind Gemini model was judged to have attained a gold-medal level performance in the International Mathematical Olympiad (“IMO”). The IMO is regarded as one of the toughest mathematical competitions requiring the use of novel and clever tricks. In our personal lives, we have been stunned by the capabilities of these models. Personally, I tried ChatGPT on one of the toughest physics problems I was ever faced with in high school- one which only a handful of us could solve. ChatGPT solved it elegantly. Hence, we have ample evidence that LLMs can reason abstractly, learn from examples, explain concepts, use language productively, model beliefs and intentions, and solve novel problems. By nearly all behavioral, functional, and cognitive benchmarks, this qualifies as intelligence. Who cares if they do not have any semantic understanding? What does that matter in the grand scheme of things?
If we are being completely honest with ourselves. If we choose to completely do away with our anthropocentric biases, then it is high time, we accepted that the Turing Test has been conclusively solved. It is also high time we rested Searle’s Chinese Room Argument against Strong AI.
The chief fundamental question should be: Given the increasing explosive capabilities of machine intelligence, what might this mean for our collective humanity? What will this mean in few years for the cherished domains of knowledge, and how will machine intelligence upend what we believe about creativity?
I do believe vehemently that the Turing Test has been conclusively solved, that Searle’s Chinese Room Argument against Strong AI is now in tatters, and the Word-Sense Disambiguation Problem and Machine Translation Problem may very well be behind us soon.
While I think that Professor Leiter’s point about LLMs reasoning like law students is a valid view, I promise Professor Leiter that the LLMs will soon surmount that problem very soon. These models have only the strongest limits to face- those stemming from physics/engineering, the limits of computation, etc.
Humans are intelligent! LLMs are intelligent! It is what it is! Case closed!
I note your enthusiasm for LLMs, but this would be more presuasive if it offered arguments in response to the doubts, above, or to Searle’s argument. But let me at least explain why LLMs will not solve the problem I identified. LLMs train only on words. If judges write opinions that do not use words expressing the *real* considerations on which they rely, then LLMs have no access to those considerations. Judges write “formalistic” opinions, i.e., opinions in which only the legally relevant facts and legal doctrines are offered as explicit rationales for their decisions. But the legal realist thesis is that at least in some cases, judges have non-legal reasons for their decisions, e.g., judgments about questions of economic and social policy, or political preferences, or sensitivity to extra-legal norms of other kinds. Since these are nowhere mentioned, there is no way for LLMs to “solve this problem.” Like law students, LLMs resolve legal questions “formalistically.”
Does that mean they’re not intelligent, though? That seems like a consequence of ignorance, not unintelligence.
LLMs now also train on multi-modal data, as others have indicated above. (The post above noting that the dependence of earlier models on the encoding of CLIP to do image generation and understanding has now been obviated is an important point to keep in mind here). We are really several years past the point where their performance was limited to autoregressive likeliest next token prediction given patterns in their training set. Even by ChatGPT a couple of years ago, models were learning from instruction fine-tuning sets, large amounts of reinforcement learning from human feedback, training on code bases from Github, Anthropic’s “Constitutional AI” approach has also taught models to self-critique their completions according to principles in a long “constitution”. Now “large reasoning models” are trained via vast amounts of reinforcement learning from verified rewards, where they can practice “talking” themselves through lots of novel problems and learning to adaptively produce the kind of self-talk that will reliably lead to more correct answers and proofs. These are the systems that achieved gold medal performance on novel proofs on the Math Olympiad challenge https://www.nature.com/articles/d41586-025-02343-x ) . It’s very hard to explain these things just in terms of interpolation or text completion.
People are also proposing training agents to learn from another in artificial social simulations. So, if we think for example about the legal reasoning domain, it already seems we’re very far beyond merely replicating ostensible reasoning in publicly available briefs and decisions. Through Constitutional AI, agents could learn to shape their legal reasoning in ways that diverge sharply from reproducing those previously observed patterns by repeatedly using trial-and-error learning to bring their justifications more in line with legal or extra-legal principles stated in a constitution. They could learn to use adaptive self-talk to calibrate multi-step problem-solving and means end reasoning to consider a variety of different justification strategies before deciding on one (I’m actually writing papers now that are skeptical we can talk their self-talk as faithful and transparent, but it’s clearly doing something to significantly improve performance relevant to generalization on difficult multi-step reasoning problems). We could even allow them to interact with other agents in simulated legal justification scenarios where they could learn themselves some of the foibles of merely mimicking publicly observable legal justifications. Things are moving so fast here that it’s hard to have much confidence in any fundamental limitations for more than a week or two even when one is appropriately based in a good understanding of the technology and the right kind of empirical evidence. One could doubtless fine-tune a model along these lines to produce judgments more like a legal realist, if one wanted. (I am definitely not advocating turning things over to LLM judges, for the record!)
My own view FWIW is that they have already mastered some real aspects of generalizable, extrapolative human intelligence, and are still barely started on others that might require things like a messy biological body or lived emotional experience, and we will all just have to come to grips with sharing the planet with uncanny partial agents which are really as powerful or more powerful than us in some ways, but still really unlike us and less powerful than us in others.
This is in reply to Mark’s question: it means they are less intelligent than humans who actually understand what courts are doing. But for reasons given by Cottrell and Maudlin earlier, I do not think it makes any sense to say LLMs are intelligent or think. They are very good mimics of thinking and intelligence, and that makes them useful for certain tasks (and less useful for others).
As I said earlier, I’m on team “intelligence is an incoherent concept”. But let’s try looking at this from an empirical tack. What would convince you that an AI system is intelligent, for some definition of intelligence? Can we set criteria that would be sufficient to say “this model is intelligent, whatever that means”?
If a machine reliably handles every detail, exigency, and nuance of every non-manual job and social interaction then I would not be surprised if many authors lose a substantial source of hesitation about calling it “intelligent” simpliciter. Also, it is a commonplace in many circles that the concept of intelligence is a cluster concept rather than being incoherent.
Mark, if I might ask a question myself. You say that “intelligence is an incoherent concept,” yet elsewhere in the discussion you seem to strongly suggest that LLMs are already intelligent. Which is it? If it is the latter, what would be reasonable to expect of LLM artificial intelligence? Should we expect it to operate along broadly human lines of intelligence, due to some form of convergent evolution, and because human intelligence is the best—indeed the only—model we have for how an advanced, science-building intelligence operates at all?
If the criteria are effectively a pragmatic “I’ll know it when I see it,” what happens if or when artificial intelligence reaches a superhuman level? What scientific laws are then expected to apply, given that such a system would exceed the presumed limits of human intellectual capacity? Even if those limits are poorly defined, there are nonetheless some reasonably assumed constraints on what human minds can and cannot do. If human AI engineers become progressively incapable of explaining what artificial intelligence agents are doing, would science and philosophy risk entering—ironically—a new age of obscurantism, in which superintelligence is no longer scientifically intelligible to us, or is intentionally opaque for reasons of self-interest and security?
I think intelligence is an incoherent concept, but I find the arguments that LLMs are *not* intelligent to be very unconvincing. That’s the reason I’m pushing back.
In the longer run, I’m not 100% sure that superintelligence is possible, even leaving aside whether it’s a coherent term. I’m not sure it isn’t, either, but I wouldn’t be surprised at all if there are diminishing returns to intelligence, however it’s defined.
Frankly, I am very, very concerned about “a new age of obscurantism”. We *already* can’t explain what these things are doing, and they’re not close to superhuman yet. That’s (part of) why I started reading philosophy! I feel I’m coming off in this debate as an LLM booster, which I am definitely *not*, except perhaps by comparison. *I’m trying to stockpile munitions for future wars.*
I use AI to edit stuff I wrote (fiction), I also use a human editor less often. AI can’t get outside the text and interrogate even in a simple way, it can’t react like a human does to a piece of fiction because it lacks experience and a life and an aesthetic sense. I can supply examples.
It can talk rather intelligently about fiction, but is limited- it’s very uncanny.
Also, moving onto philosophy, I asked Chat (the AI I use most frequently) about the differences between Stoicism and Existentialism and its answer impressed an authority on Stoicism I correspond with who taught me in college.
It can’t do philosophy, it can’t interrogate a piece of fiction as a work of art and it can’t as Professor Leiter pointed out see the law as a judge does.
I don’t see how analyzing the text any better will make a difference – it can’t “think” at these levels listed because it has no mind per se and has no experience- it just has text.
The authors assert that intelligence does not require agency, comparing LLMs to the Oracle of Delphi. This misses a constitutive point: intelligence is not just competence; it is intrinsically goal-directed competence. We call a response “intelligent” only relative to a goal it serves. Crucially, for the intelligence to be attributed to the system itself, the goal-directedness must be intrinsic, not borrowed.
A calculator exhibits arithmetic competence, but its goals are wholly set by the user. Its “intelligence” is derivative. The authors imagine scaling this to a “generally intelligent” but still goal-external system. Yet such a system would remain an instrument, extremely versatile, but still waiting for external instruction about what to do and why. Its flexibility is merely a repertoire of responses; intelligence is what selects and sequences these responses toward an end. That selective, directing capacity is agency.
This points to a fundamental distinction: between artifacts (machines, tools), which are extrinsically or heteronomously goal-directed, and agents, which are intrinsically or autonomously goal-directed. The necessity of agency becomes clear when we consider what “generality” entails. If general intelligence involves competence in novel, open-ended contexts, then in such a context, the relevant task or goal is not pre-specified. Therefore, the system itself must have the capacity to generate and evaluate potential goals, to determine what is worth doing. This capacity for endogenous goal formation is agency. Without it, a system’s flexibility remains latent, a mere menu, until an external agent selects from it. The “general” intelligence would belong to the user, not the tool.
True general intelligence, therefore, is not just broad competence; it is the autonomous capacity to direct that competence. This autonomy is not incidental; it is the very capacity that enables a system to navigate novelty. The Oracle of Delphi analogy fails because the oracle was presumed to possess such an internal (divine) purposiveness; its passivity was a chosen mode of interaction, not a lack of agency. An LLM has no such internal perspective or self-generated ends.
Divorcing intelligence from intrinsic goal-directedness confuses instrumental power with agency. Until LLMs can form and pursue their own ends in an open world, they simulate rather than replicate intelligence. Recognizing this distinction is crucial: it determines whether we are engineering better instruments or creating new kinds of agency/minds.
There is some suggestions from mechanistic interpretability work that LLMs can represent intentions and goals of both themselves and other agents, and can plan ahead their completions “under the hood” to explore options before producing their completions (https://www.anthropic.com/research/tracing-thoughts-language-model ). LRMs with reasoning traces also explicitly appear to set subgoals, which they can track the completion of and return to if they reach a dead end on another strategy, or think they might have made an earlier mistake (you can even configure them to do more or less of this subgoal-setting behavior to see if it improves performance on certain types of problems: https://arxiv.org/abs/2503.01307 ). Again it’s possible that this self-generated text does not faithfully represent their true underlying computational processes, but because the text is there and they process it with their own self-attention in context, it provides a way to scaffold goal-directed-like behavior.
Anyway, this is not a fundamental limitation of the technology, but probably more like a product choice. Companies want their models to be predictable once deployed, and so don’t want they developing their own unpredictable goals and intentions in the field. It would be a huge safety risk. But research has investigated agents that can develop their own goals, e.g. Cedric Colas’ work on “autotelic” agents: https://arxiv.org/abs/2305.12487 .
Leave a Reply to Daniel Kaufman Cancel reply