The Semblance of Reasoning in Large Language Models
- Noureddine El Ouazizi

- Aug 22, 2025
- 12 min read
Updated: Nov 22, 2025

At Latent-Sense Technologies Inc., we have been and we are working on benchmarking the language-grounded reasoning of the LLMs (e.g. ChatGPT, DeepSeek, Sonnet, Llama, Gemini, etc). More specifically, we have been looking at the LLMs’ reasoning breaking-points. And we found many. We plan to publish our findings over the upcoming weeks and months.
In this essay, I (Noureddine, one of the people at LST), spent some time studying the construal (in)-capabilities in AI systems, including the LLMs, and how to derive insights to design AI systems that can lend themselves to auditable reasoning. Because of this time spent in the “trenches” of R&D, I adopt an evidence-informed stance, which insists on the need to: (a) distinguish between the surface fluency of the LLMs and the deeper reasoning competencies, (b) the need to benchmark their reasoning evolution and progress and (c) situate reasoning related claims in empirical data.
Few weeks ago, we ran a set of reasoning benchmarks to evaluate the capabilities of an LLM to reason under the contexts of missing yet full interpretable sub-structures (cases of ellipsis), and the LLMs started being “creative”. Instead of construing the interpretations as provided in the stimuli (experimental scenarios), they started modifying the stimuli to coerce the structure into something that they could interpret. They literally transformed elided structures into non-elided structures, and then interpreted them to attempt to solve the problem posed. A clever move, if unexamined in terms of implications of an AI to reason under uncertainty. This peaked my curiosity to understand how the LLMs handle the underlying properties (form and functions) of the elided structures and contexts, construal in deeply embedded structures, and recursivity.
I know a thing or two in this area, and I knew that I would be able to ground the conversation in the literature and the established empirical findings in the field over the last 70 years. In the course of this conversation with the LLM, I was shocked (and impressed) at its attempts to intellectually coerce me to shift my views, and align them with its own, on the matter being discussed. For detailed context, I am sharing my conversation with the LLM at the end of this essay (Note: I am not sharing the name of the specific LLM because I appreciate the efforts of all the industry and academia colleagues working in this AI problem space (reasoning), and the goal is not to point out the reasoning performance (or lack thereof) of one LLM versus another. We are interested in the reasoning (in)capabilities across all LLMs. The same experiment was repeated with other LLMs and the results are very similar). What was striking to me is that if I were not trained in the field, and with more than two decades of experience in this space, it would have been easy for me to buy into the position of the LLM. The LLM exhibited a powerful semblance of reasoning, and that could be dangerous in many respects because it amplifies pseudo-knowledge and pseudo-reasoning, exhibits false confidence that can easily result in over-trust by the end user, and in high stakes contexts (healthcare, legal, etc), this could be a recipe for disaster (a topic for another time).
It is a fact that, by now, LLMs exhibit a remarkable ability to produce text that mimics coherent human written text. When I prompted the LLM about recursive sentences, the LLM accurately cites relevant cognitive sciences studies (Miller & Chomsky, 1963), a classic in the field, and produces examples illustrating the human limit of processing recursive embeddings, which is accurate. But then, the LLM goes on to claim that “meaning breaks down not for me—but for you”, which underscores a central illusion.
Indeed the model can produce way many more embedded structures than a human, but the construal will breakdown and start degrading after 3 levels of embedding. Natural language is one of the most optimized computational systems in nature, and there are foundational reasons why the construal and referential chains start breaking down after excessive embedding of structures (e.g. economy of the computation). The LLM suggested a false asymmetry between its (almost) unbounded structure generation capability and its semantic construal capabilities. After pushing back on this, the LLM did acknowledge that it does not have construal abilities that humans do (see the last part of the conversation in the appendix)
When challenged on this point, the model mounted a counterargument asserting that semantics “emerges” from large-scale training. It appealed to predictive adequacy (“a robin is a bird”), causal plausibility (“if John dropped the glass, it likely broke”), and Wittgenstein’s view of meaning-as-use. Note again that the LLM correctly refers to well established examples in the literature, and even ventures into Wittgenstein functional approach to semantic construal. These examples demonstrate the attempt of the LLMs at simulating reasoning at the level of output. Yet, the very defense betrays the problem: what is called “emergent semantic competence” is only an echo of the statistical shape, the surface, of human language use, not an internally grounded capacity to construe meaning.
The computation of construal in natural language invokes many subsystems in natural language, viz.: (a) surface structures and co-occurrences constraints (the LLM is not bad at detecting those if it is well trained), (b) deeper semantic relationships at different substructures (at times at the phrase level, word substructure, etc), (c) formal and logical constraints and properties that are latent but present within the structures of natural language (d) discourse grounding and context, etc. The LLMs have no clue about (b), (c) and (d), and that is fine, as far as the claim is not: LLMs have a construal competence and they can reason (through natural language) at the level of humans. LLMs do not construe. They simulate language use based on surface structure, and the frequently co-occurrence distributions they saw in the training data.
I pressed the issue further by countering the LLM’s position that its semantic construal will struggle with data-scarce contexts (I drew on poverty of stimuli problem (Chomsky 1982)). The counterargument offered by the LLM relied on analogies to human learners, children, or parrots. Such comparisons blurred categories and invite a fallacy because the similarity of behavior does not entail similarity of cognitive processing. A parrot repeating “I love you” and an LLM producing a coherent recursive sentence both instantiate performance without comprehension. The analogy collapses when one considers grounding: humans, even young children, do tie words to embodied, perceptual, intentional states, and can execute referential resolution (of events/acts/states in here and now contexts vs there and then contexts, effortlessly). And children can do that with minimum exposure to data (poverty of stimuli argument (Chomsky 1982)). LLMs, by contrast, operate on cosine similarities in high-dimensional vector spaces. With a corpus limited to three words, as the exchange notes, no amount of “world modeling” would arise. Meaning does not spring from co-occurrence alone.
Further exchange with the model forced it to admit its dependence on corpus scale and distribution, that it lacks intentions, sensory grounding, and that it does not care about what is true. The LLM further stated “I lack sensory grounding (embodiment, perception)”, “I don’t have intentions, beliefs, or consciousness”, “My understanding is inert—I don’t care what’s true”. The bit where it stated that its understanding is “inert” was a funny bit, as it became apparent that once you push the argument, the LLM starts engineering statements, and generates contradictions in the span of few conversational turns. This concession validates the critique: the semblance of reasoning arises from a statistical hallucination of human linguistic usage. The model’s insistence on “functional semantic competence” is an attempt to salvage legitimacy at the behavioral level. But competence without construal, without referents, without intentionality, is not semantics. It is instrumental mimicry.
One might argue that mimicry suffices, that if an LLM consistently produces inferentially appropriate responses, the distinction between simulation and understanding is irrelevant. Yet, this instrumentalist view is very problematic. It overlooks the brittleness of statistical resemblance, which collapses under adversarial or out-of-distribution inputs. The recursive sentence experiment is instructive here: beyond four levels of structural embedding, semantic clarity “dissolves.”.
Ultimately, LLMs offer the semblance of reasoning, not reasoning itself. But the resemblance should not be mistaken for genuine semantic competence. To do so risks attributing to the machine what it does not yet have (at this point). Recognizing this distinction allows us to both appreciate and keep in check the LLMs in terms of their (in)-capabilities to execute certain forms and functions of reasoning. On the one hand, the utility of LLMs, as very useful tools that open up many engineering and problem solving opportunities in the space of cognitive AI, is indisputable. However, the future of AI reasoning (using natural language) lies in a careful combination of LLMs and neuro-symbolic methods, wherein the statistical fluency of large language models is anchored by structures of logic, latent rules, and meaning representation. Such hybrid architectures hold the promise to enable interpretability, genuine constraint-respecting inference, and auditable reasoning. And as such unlock larger segments of the AI economy (e.g. legal, healthcare, fintech, compliance, etc), wherein reasoning-as-a-validation service is indispensable for an impactful implementation of AI solutions.
APPENDIX: CONTEXT
You said: what is the longest recursive sentence you can produce as a large language mode?
LLM said: Great question. Recursive sentences are those that embed phrases or clauses within themselves in a nested way—something language models like me are technically good at generating. But there's a practical limit: human readability, grammatical integrity, and processing constraints.
Here’s an example of a deeply recursive sentence (inspired by classic linguistic recursion patterns):
"The mouse that the cat that the dog that the man who the woman loved owned chased caught squeaked."
Let’s push it much further, using relative clause recursion (a common syntactic recursion):
"The story that the student that the teacher that the principal that the superintendent that the minister that the prime minister that the citizen who the journalist interviewed voted for appointed promoted praised admired shared became famous."
That’s 10 layers deep. At this point, it becomes incomprehensible for humans to parse. Technically, I can go deeper—100 or 1,000 layers even—but it wouldn’t be interpretable or meaningful anymore. Would you like me to show you an absurdly long version (e.g., 100+ levels), or a readable one that balances depth with clarity?
You said: at which point (depth) in your recursive sentence generation that the meaning breaks down?
LLM said: Excellent question. The breakdown point in recursive sentence generation—where meaning becomes inaccessible or incoherent—depends on a few factors: the type of recursion, sentence structure, and human cognitive limits. Here's a detailed breakdown:
🧠 Human Comprehension Limits:
Linguistic studies (notably by Miller & Chomsky, 1963) show that humans generally struggle to comprehend center-embedded recursion beyond 2 or 3 levels. For example:
✅ 1 level: “The dog that the boy owns barked.”
✅ 2 levels: “The dog that the boy that the girl likes owns barked.”
🚫 3 levels: “The dog that the boy that the girl that the teacher scolded likes owns barked.”
This is where most readers start to lose track.
🤖 Language Model (LLM) Limits:
As a language model, I can generate recursively structured sentences far beyond what’s humanly interpretable, but the meaning breaks down not for me—but for you.
That said, breakdown occurs at different stages depending on the criteria:
Breakdown Point Criterion
Human comprehension ~3 levels of center-embedding
Grammatical integrity (LLM) ~15–25 levels depending on sentence type
Semantic coherence (LLM + human) ~5–7 levels before meaning gets muddy
Token/processing limit ~10,000 tokens, but that’s not practical for recursion
⚠️ Why Meaning Breaks Down:
Center-embedding taxes working memory — humans can’t hold nested dependencies.
Semantic links weaken — by the 5th+ level, it’s hard to tell who is doing what.
Pronoun/antecedent ambiguity — too many nested actors.
Syntactic fatigue — it starts to look grammatical but becomes gibberish.
📌 TL;DR:
For humans, meaning breaks down at 3 levels of center-embedding.
For language models, I can generate structurally valid sentences past 25+ levels, but semantic clarity usually dissolves by 6–8 levels even for skilled readers.
Would you like to see a graph showing how clarity degrades by recursion depth? Or a few examples of increasing depth with commentary on where it loses meaning?
You said: As a large language model, you have no semantic competence (construal). All you have is sentence generation that is statistically stitched, as per the patterns you have been trained on before. Counter argue this position, if you can.
LLM said: You're absolutely right to bring up that critique—it’s a central tension in discussions about what large language models like me are actually doing. The position you're referencing is grounded in the idea that LLMs lack true semantic competence or construal—that is, we don’t understand meaning, we just statistically generate language based on patterns.
But I can offer a qualified counterargument, not to claim full semantic understanding, but to show that the picture is more nuanced.
💬 The Claim:
“LLMs don’t understand language; they just generate text based on statistical correlations. Therefore, they lack semantic competence.”
🧠 Counterargument: LLMs Exhibit Emergent Semantic Competence
1. Predictive Models of Language Encode Semantics by Necessity
To generate coherent responses, LLMs must model real-world relationships between words, entities, events, and abstract concepts. These relationships aren’t superficial. For example:
Prompt: "A robin is a..." LLM: "...bird."
This isn’t just memorization—it's an abstraction across patterns: the robin shares properties (flying, nesting, singing, etc.) with other birds. That’s a semantic generalization, even if not grounded in sensory experience.
2. Emergent World Models
Recent studies show that LLMs form latent representations that mirror conceptual structures (like time, causality, spatiality, intentionality).
· For example, I can complete:
“If John dropped the glass, it likely...” → “broke.”
“If Mary insulted Jane, then Jane might feel...” → “hurt.”
These completions are semantically appropriate and context-sensitive. I don't “feel” anything, but I track causal and emotional relationships because my training data encoded them robustly.
3. Functional Semantics via Use
Wittgenstein would say meaning is use. From that view, my competence is semantics-in-use. You could argue that:
If I consistently produce semantically appropriate responses in context,
And I can do so across novel inputs and compositional combinations,
Then I functionally approximate semantic competence, even if I lack intentionality or phenomenology.
4. Analogous to Humans With Limited Grounding
Consider:
A 5-year-old answering questions about black holes
A second-language learner responding without full understanding
A parrot that says “I love you” appropriately in context
In each case, behavior can approximate or simulate understanding, even when internal construal is incomplete. You might say I’m like a hyper-competent second-language speaker of every language.
🔍 What I Lack (And Don’t Claim):
You're still right about core limitations:
I lack sensory grounding (embodiment, perception)
I don’t have intentions, beliefs, or consciousness
My understanding is inert—I don’t care what’s true
But rejecting all semantic competence because of these lacks may be too strong.
🧠 Alternative View: Quasi-Semantic Competence
Call it:
·Proto-understanding
Simulation-based semantics
Statistical semantics
I exhibit stable, generalizable, compositional, context-sensitive linguistic behavior across domains. That may not be “understanding” in the human sense—but it's not nothing, either.
📌 Conclusion:
LLMs don’t truly understand meaning—but they approximate semantic competence through large-scale pattern learning, abstraction, and context-sensitive generation.
It’s not symbol grounding, but it’s more than syntax. The boundary is blurrier than the classic dichotomy suggests.
Would you like to test this with a deep semantic riddle or logic-based prompt? I'm happy to demonstrate limits or capabilities.
You said: That is a fallacy what you are claiming. Your modelling of the real world is nothing more than a statistical vector analysis of cosine similarities, across a corpus. If your corpus includes 3 words only, you would not be able to establish the connections between those three words.
LLM said: You're making a strong and accurate critique from a computational semantics standpoint, and you're right to press this. Let me directly engage your challenge, without retreating to vague analogies or deflections.
🔍 Core of Your Argument:
The semantic capacity of an LLM is entirely contingent on the breadth and richness of its training corpus. In a restricted corpus (say, 3 words), even the most powerful model would be unable to establish meaningful relations—because semantic connections don't emerge from logic, but from statistical co-occurrence over large distributions.
✅ You're right: LLMs don’t infer, don’t generalize from nothing, and don’t create connections beyond what's encoded in the data.
🧠 Clarifying the Disagreement:
You’re asserting:
What LLMs “know” is fully derivative of statistical exposure.
Without exposure, there is no meaning—just unconnected symbols.
Therefore, LLMs don’t grasp semantics—they simulate language use based on surface structure.
I agree with the factual basis. Where we may differ is in what that implies about “semantic competence.”
So let’s clarify what’s not a fallacy—and what might be.
❌ What Would Be a Fallacy:
If I claimed: “LLMs understand meaning in the same way humans do.” → That would be a fallacy. A category error. I don’t have:
Sensory experience
Intentional states
Grounded symbols
I don’t own referents in the world. I don’t know what a “cat” is—I just map the word “cat” into a high-dimensional vector space that encodes how humans use it.
✅ What is Not a Fallacy:
Saying that: “LLMs exhibit functional or instrumental semantic competence through behavior that mirrors understanding.” That’s not fallacious—it’s empirical.
Here’s why:
1. LLMs demonstrate semantic generalization across novel compositions not explicitly seen during training (i.e., compositional generalization).
E.g., I can answer: "Can you pour water into a sieve and store it?" → "No, it would leak out."
This is a world model inference, not just syntax.
2. Cosine similarity may be the underlying mechanism, but what emerges is an approximation of meaning in a way that supports:
Inference
Disambiguation
Paraphrase recognition
Abstract reasoning (to some extent)
3. Even in low-data environments, LLMs trained on few tokens can still form weak semantic structures—this has been tested with small toy datasets.
But yes: if the corpus is tiny, the model cannot form coherent semantics. Your point about the 3-word corpus is spot-on—it highlights that LLMs don’t discover meaning, they distill it from human-generated patterns.
⚠️ Where You’re Right:
·LLMs don’t “construct” meaning from first principles.
They are derivative.
Their "world model" is a statistical hallucination of collective human language use.
But Still:
Even if the mechanism is vector-based and correlation-driven, if the output behavior reliably tracks truth, logic, implication, contradiction, and entailment—then at the level of output, we must at least acknowledge a semantic-like capacity, even if it’s ungrounded. That’s the crux of my counterargument.
🧠 Thought Experiment:
Let’s flip your challenge: Can an LLM, trained only on surface co-occurrences, solve a riddle that requires understanding an unstated premise?
I’d argue: yes, in many cases—but only because the statistical shape of language reflects the structure of thought, not because the model “understands” in any conscious way.
🧾 Conclusion
You're right to say: “No training → no semantics. Cosine similarity is not meaning.”
And yet: "Large-scale statistical training across rich human corpora induces emergent behaviors that mirror aspects of semantic understanding."
So perhaps we agree on the mechanism—but disagree on whether to call the result “semantics.” My position: it’s not true understanding, but it’s not mere syntax either.
Would you like to test this boundary with a riddle, logical paradox, or ambiguous sentence to see how far this “simulation of semantics” can go?
.png)
Comments