When we published “Scaling AI Requires Scaling Trust,” we expected pushback and debate. What we didn’t expect was how much convergence there would be across people who usually sit in very different corners of the AI world: data and AI leaders, cybersecurity and infra architects, strategists, change practitioners and ethicists.
Across more than 70 expert voices, from founders and architects to national-lab scientists and enterprise transformation leaders, a single theme surfaced again and again:
AI is not failing because models are weak. It’s failing because organizations cannot see, interrogate, or trust how AI makes decisions. The blocker is not technology – it’s opacity.
Even commenters who challenged the statistics or the BCG 10-20-70 framing largely agreed on the underlying pattern: most AI spend is not turning into sustained, explainable business impact. One practitioner reduced the whole debate to a simpler test: “Does it make sense?” If people can’t answer that about an AI-driven decision, adoption stalls, no matter how sophisticated the model is.
What follows is a synthesis of some of the core ideas raised in the comments and reposts – and what these ideas collectively suggest about where enterprise AI needs to go next.
1. AI Fails in the Dark: Ambiguity, Loss of Agency, and Organizational Resistance
Many commenters described a familiar pattern. AI initiatives stall not because people dislike technology, but because they distrust opaque decision-making. The fear isn’t AI itself. It’s not knowing why AI produces the outputs that affect people’s job, workflows, and accountability.
Several commenters made the same point in different language: people are not ready to outsource consequential decisions to a mysterious process. When AI arrives as a black box, employees experience it as something being done to them, rather than with them. In that context:
Resistance is often rational self-protection.
“Change fatigue” is a symptom of unclear decision logic.
Adoption becomes an ongoing negotiation with uncertainty.
As one commenter put it, people don’t resist technology. They resist the ambiguity wrapped around it.
2. The Real Bottleneck is Organizational Cognition, Not Data or Models
Several leaders reframed the problem not as “change management” but as something deeper. Enterprises are not cognitively equipped to adopt AI when its inner logic is unreadable, non-traceable, and non-interrogable.
It’s not enough for technical teams to understand the model.
Executives need to know what the system optimizes for, and where it might fail.
Risk and compliance teams need to see how decisions relate to policies and controls.
Frontline teams need a way to ask, “Why this recommendation instead of that one?” in their own language.
One AI architect in the thread called this a “70% epistemology [problem] – how a company reasons about a system smarter than its org chart.” Companies are wired for predictable, rule-based processes, but modern AI is probabilistic and constantly evolving. The gap makes it harder to understand and govern. Without a way to surface machine reasoning in human-legible terms (policies, workflows, incentives, constraints), even well-understood systems behave like black boxes for everyone outside the AI team.
3. From “Adoption” to “Transplant”
One repost offered a striking analogy: enterprise AI is less like adopting a tool and more “like an organ transplant.”
Consumer technologies (like general-purpose chat interfaces) are easy to absorb. They slip into existing habits with minimal coordination. Impactful enterprise AI is different: it’s introduced into living systems of roles, incentives, controls, and legacy infrastructure. If it’s not clearly defined, well placed, and continuously monitored, the organization will reject it.
In that framing, two conditions determine success:
Clarity of problem and purpose: precisely what is the system meant to improve, and how will we measure it?
Clarity of process and people: where does AI sit in the decision loop, and who is accountable for its outputs and exceptions?
Trust is earned when people can see how it works, understand the impact, and know who is accountable. With trust, the organization is far more likely to accept the change rather than wall it off as a threat.
4. Opacity Creates a Fog That Kills ROI – in Models and in Workflows
Multiple practitioners stressed ambiguity, not accuracy, as the hidden cost center. Without transparent reasoning, even accurate AI feels risky.
That ambiguity shows up as:
Hesitation to roll out beyond pilots because leaders don’t know how to explain logic to regulators, auditors, or customers.
Quiet workarounds where teams revert to spreadsheets and email because those tools, however manual, are at least understandable.
5. Leaders in Ethics and Governance Point to Accountability Gaps
Commenters working in ethics, risk, and governance pointed to a simple reality: AI must become auditable in the same way finance, security, or supply chains are. Without it, scaling AI becomes either a permanent pilot or a compliance nightmare.
In the original essay, we emphasized auditability. The discussion in the comments added an important nuance: governance is socio-technical. Humans define the rules and responsibilities; AI systems can help apply, monitor, and enforce those rules, but only if their reasoning is transparent and their behavior can be inspected and corrected.
To scale AI responsibly, organizations need:
Clear corporate values and policies that define acceptable decisions and trade-offs.
Decision pathways where human accountability remains explicit, even when AI is in the loop, and where AI agents can surface policy conflicts, inconsistencies, and edge cases instead of silently routing around them.
Trust now becomes the outcome of aligning AI reasoning with the same disciplines we expect in finance, security, and supply chains: well-defined controls, observable behavior, and documented responsibility.
A glass-box, neuro-symbolic AI framework – with a swarm of reasoning agents, an orchestrator, and humans in the loop – doesn’t replace governance. It strengthens it, by turning policies and constraints into something the machine can follow, explain, and improve on, rather than treat as an afterthought.
6. A Few Voices Challenged the Premise and Revealed Another Insight
Not everyone agreed with the framing. Some commenters questioned the premise behind the statistics or the way the problem was framed. Others argued that AI isn’t truly a black box, or that the real issue lies in brittle middleware and non-AI infrastructure.
The critique is useful because it surfaced a deeper problem: there is no shared definition of trust, transparency, or explainability in enterprise AI. The lack of shared language is itself part of the failure loop.
When stakeholders use the same words to mean different things, even well-intentioned initiatives fragment. Part of scaling trust is agreeing on what we are actually promising: not magic, not perfection, but systems whose behavior is visible, contestable, auditable, and aligned with human intent.
In Summary: The Industry Wants Glass-Box AI, Even If Not Everyone Calls It That
Across 70+ comments and reposts, five universal demands emerged, regardless of terminology or background:
Transparency: decisions and workflows that are no longer opaque.
Reasoning explainability: inspectable logic paths rather than inscrutable outputs.
Governability and auditability: decisions that can be tested, challenged and defended.
Causal traceability across both decisions and execution: the ability to follow “what caused what” across states and systems with clear evidence.
Interrogable decision pathways with clear human agency: people can question, override, and redirect AI, and know who owns the outcome.
These are precisely the design pillars of Latent-Sense Technologies.
Glass-box AI is not an enhancement to AI. It is the missing infrastructure that turns intelligence into something organizations can understand, govern, and build around. It is what allows AI to become safe, trustworthy, compliant, and economically more valuable.
And ultimately, ready for real enterprise adoption.
At Latent-Sense Technologies Inc., we have been and we are working on benchmarking the language-grounded reasoning of the LLMs (e.g. ChatGPT, DeepSeek, Sonnet, Llama, Gemini, etc). More specifically, we have been looking at the LLMs’ reasoning breaking-points. And we found many. We plan to publish our findings over the upcoming weeks and months.
In this essay, I (Noureddine, one of the people at LST), spent some time studying the construal (in)-capabilities in AI systems, including the LLMs, and how to derive insights to design AI systems that can lend themselves to auditable reasoning. Because of this time spent in the “trenches” of R&D, I adopt an evidence-informed stance, which insists on the need to: (a) distinguish between the surface fluency of the LLMs and the deeper reasoning competencies, (b) the need to benchmark their reasoning evolution and progress and (c) situate reasoning related claims in empirical data.
Few weeks ago, we ran a set of reasoning benchmarks to evaluate the capabilities of an LLM to reason under the contexts of missing yet full interpretable sub-structures (cases of ellipsis), and the LLMs started being “creative”. Instead of construing the interpretations as provided in the stimuli (experimental scenarios), they started modifying the stimuli to coerce the structure into something that they could interpret. They literally transformed elided structures into non-elided structures, and then interpreted them to attempt to solve the problem posed. A clever move, if unexamined in terms of implications of an AI to reason under uncertainty. This peaked my curiosity to understand how the LLMs handle the underlying properties (form and functions) of the elided structures and contexts, construal in deeply embedded structures, and recursivity.
I know a thing or two in this area, and I knew that I would be able to ground the conversation in the literature and the established empirical findings in the field over the last 70 years. In the course of this conversation with the LLM, I was shocked (and impressed) at its attempts to intellectually coerce me to shift my views, and align them with its own, on the matter being discussed. For detailed context, I am sharing my conversation with the LLM at the end of this essay (Note: I am not sharing the name of the specific LLM because I appreciate the efforts of all the industry and academia colleagues working in this AI problem space (reasoning), and the goal is not to point out the reasoning performance (or lack thereof) of one LLM versus another. We are interested in the reasoning (in)capabilities across all LLMs. The same experiment was repeated with other LLMs and the results are very similar). What was striking to me is that if I were not trained in the field, and with more than two decades of experience in this space, it would have been easy for me to buy into the position of the LLM. The LLM exhibited a powerful semblance of reasoning, and that could be dangerous in many respects because it amplifies pseudo-knowledge and pseudo-reasoning, exhibits false confidence that can easily result in over-trust by the end user, and in high stakes contexts (healthcare, legal, etc), this could be a recipe for disaster (a topic for another time).
It is a fact that, by now, LLMs exhibit a remarkable ability to produce text that mimics coherent human written text. When I prompted the LLM about recursive sentences, the LLM accurately cites relevant cognitive sciences studies (Miller & Chomsky, 1963), a classic in the field, and produces examples illustrating the human limit of processing recursive embeddings, which is accurate. But then, the LLM goes on to claim that “meaning breaks down not for me—but for you”, which underscores a central illusion.
Indeed the model can produce way many more embedded structures than a human, but the construal will breakdown and start degrading after 3 levels of embedding. Natural language is one of the most optimized computational systems in nature, and there are foundational reasons why the construal and referential chains start breaking down after excessive embedding of structures (e.g. economy of the computation). The LLM suggested a false asymmetry between its (almost) unbounded structure generation capability and its semantic construal capabilities. After pushing back on this, the LLM did acknowledge that it does not have construal abilities that humans do (see the last part of the conversation in the appendix)
When challenged on this point, the model mounted a counterargument asserting that semantics “emerges” from large-scale training. It appealed to predictive adequacy (“a robin is a bird”), causal plausibility (“if John dropped the glass, it likely broke”), and Wittgenstein’s view of meaning-as-use. Note again that the LLM correctly refers to well established examples in the literature, and even ventures into Wittgenstein functional approach to semantic construal. These examples demonstrate the attempt of the LLMs at simulating reasoning at the level of output. Yet, the very defense betrays the problem: what is called “emergent semantic competence” is only an echo of the statistical shape, the surface, of human language use, not an internally grounded capacity to construe meaning.
The computation of construal in natural language invokes many subsystems in natural language, viz.: (a) surface structures and co-occurrences constraints (the LLM is not bad at detecting those if it is well trained), (b) deeper semantic relationships at different substructures (at times at the phrase level, word substructure, etc), (c) formal and logical constraints and properties that are latent but present within the structures of natural language (d) discourse grounding and context, etc. The LLMs have no clue about (b), (c) and (d), and that is fine, as far as the claim is not: LLMs have a construal competence and they can reason (through natural language) at the level of humans. LLMs do not construe. They simulate language use based on surface structure, and the frequently co-occurrence distributions they saw in the training data.
I pressed the issue further by countering the LLM’s position that its semantic construal will struggle with data-scarce contexts (I drew on poverty of stimuli problem (Chomsky 1982)). The counterargument offered by the LLM relied on analogies to human learners, children, or parrots. Such comparisons blurred categories and invite a fallacy because the similarity of behavior does not entail similarity of cognitive processing. A parrot repeating “I love you” and an LLM producing a coherent recursive sentence both instantiate performance without comprehension. The analogy collapses when one considers grounding: humans, even young children, do tie words to embodied, perceptual, intentional states, and can execute referential resolution (of events/acts/states in here and now contexts vs there and then contexts, effortlessly). And children can do that with minimum exposure to data (poverty of stimuli argument (Chomsky 1982)). LLMs, by contrast, operate on cosine similarities in high-dimensional vector spaces. With a corpus limited to three words, as the exchange notes, no amount of “world modeling” would arise. Meaning does not spring from co-occurrence alone.
Further exchange with the model forced it to admit its dependence on corpus scale and distribution, that it lacks intentions, sensory grounding, and that it does not care about what is true. The LLM further stated “I lack sensory grounding (embodiment, perception)”, “I don’t have intentions, beliefs, or consciousness”, “My understanding is inert—I don’t care what’s true”. The bit where it stated that its understanding is “inert” was a funny bit, as it became apparent that once you push the argument, the LLM starts engineering statements, and generates contradictions in the span of few conversational turns. This concession validates the critique: the semblance of reasoning arises from a statistical hallucination of human linguistic usage. The model’s insistence on “functional semantic competence” is an attempt to salvage legitimacy at the behavioral level. But competence without construal, without referents, without intentionality, is not semantics. It is instrumental mimicry.
One might argue that mimicry suffices, that if an LLM consistently produces inferentially appropriate responses, the distinction between simulation and understanding is irrelevant. Yet, this instrumentalist view is very problematic. It overlooks the brittleness of statistical resemblance, which collapses under adversarial or out-of-distribution inputs. The recursive sentence experiment is instructive here: beyond four levels of structural embedding, semantic clarity “dissolves.”.
Ultimately, LLMs offer the semblance of reasoning, not reasoning itself. But the resemblance should not be mistaken for genuine semantic competence. To do so risks attributing to the machine what it does not yet have (at this point). Recognizing this distinction allows us to both appreciate and keep in check the LLMs in terms of their (in)-capabilities to execute certain forms and functions of reasoning. On the one hand, the utility of LLMs, as very useful tools that open up many engineering and problem solving opportunities in the space of cognitive AI, is indisputable. However, the future of AI reasoning (using natural language) lies in a careful combination of LLMs and neuro-symbolic methods, wherein the statistical fluency of large language models is anchored by structures of logic, latent rules, and meaning representation. Such hybrid architectures hold the promise to enable interpretability, genuine constraint-respecting inference, and auditable reasoning. And as such unlock larger segments of the AI economy (e.g. legal, healthcare, fintech, compliance, etc), wherein reasoning-as-a-validation service is indispensable for an impactful implementation of AI solutions.
APPENDIX: CONTEXT
You said: what is the longest recursive sentence you can produce as a large language mode?
LLM said: Great question. Recursive sentences are those that embed phrases or clauses within themselves in a nested way—something language models like me are technically good at generating. But there's a practical limit: human readability, grammatical integrity, and processing constraints.
Here’s an example of a deeply recursive sentence (inspired by classic linguistic recursion patterns):
"The mouse that the cat that the dog that the man who the woman loved owned chased caught squeaked."
Let’s push it much further, using relative clause recursion (a common syntactic recursion):
"The story that the student that the teacher that the principal that the superintendent that the minister that the prime minister that the citizen who the journalist interviewed voted for appointed promoted praised admired shared became famous."
That’s 10 layers deep. At this point, it becomes incomprehensible for humans to parse. Technically, I can go deeper—100 or 1,000 layers even—but it wouldn’t be interpretable or meaningful anymore. Would you like me to show you an absurdly long version (e.g., 100+ levels), or a readable one that balances depth with clarity?
You said: at which point (depth) in your recursive sentence generation that the meaning breaks down?
LLM said: Excellent question. The breakdown point in recursive sentence generation—where meaning becomes inaccessible or incoherent—depends on a few factors: the type of recursion, sentence structure, and human cognitive limits. Here's a detailed breakdown:
🧠 Human Comprehension Limits:
Linguistic studies (notably by Miller & Chomsky, 1963) show that humans generally struggle to comprehend center-embedded recursion beyond 2 or 3 levels. For example:
✅ 1 level: “The dog that the boy owns barked.”
✅ 2 levels: “The dog that the boy that the girl likes owns barked.”
🚫 3 levels: “The dog that the boy that the girl that the teacher scolded likes owns barked.”
This is where most readers start to lose track.
🤖 Language Model (LLM) Limits:
As a language model, I can generate recursively structured sentences far beyond what’s humanly interpretable, but the meaning breaks down not for me—but for you.
That said, breakdown occurs at different stages depending on the criteria:
Breakdown Point Criterion
Human comprehension ~3 levels of center-embedding
Grammatical integrity (LLM) ~15–25 levels depending on sentence type
Semantic coherence (LLM + human) ~5–7 levels before meaning gets muddy
Token/processing limit ~10,000 tokens, but that’s not practical for recursion
⚠️ Why Meaning Breaks Down:
Center-embedding taxes working memory — humans can’t hold nested dependencies.
Semantic links weaken — by the 5th+ level, it’s hard to tell who is doing what.
Pronoun/antecedent ambiguity — too many nested actors.
Syntactic fatigue — it starts to look grammatical but becomes gibberish.
📌 TL;DR:
For humans, meaning breaks down at 3 levels of center-embedding.
For language models, I can generate structurally valid sentences past 25+ levels, but semantic clarity usually dissolves by 6–8 levels even for skilled readers.
Would you like to see a graph showing how clarity degrades by recursion depth? Or a few examples of increasing depth with commentary on where it loses meaning?
You said: As a large language model, you have no semantic competence (construal). All you have is sentence generation that is statistically stitched, as per the patterns you have been trained on before. Counter argue this position, if you can.
LLM said: You're absolutely right to bring up that critique—it’s a central tension in discussions about what large language models like me are actually doing. The position you're referencing is grounded in the idea that LLMs lack true semantic competence or construal—that is, we don’t understand meaning, we just statistically generate language based on patterns.
But I can offer a qualified counterargument, not to claim full semantic understanding, but to show that the picture is more nuanced.
💬 The Claim:
“LLMs don’t understand language; they just generate text based on statistical correlations. Therefore, they lack semantic competence.”
1. Predictive Models of Language Encode Semantics by Necessity
To generate coherent responses, LLMs must model real-world relationships between words, entities, events, and abstract concepts. These relationships aren’t superficial. For example:
Prompt: "A robin is a..." LLM: "...bird."
This isn’t just memorization—it's an abstraction across patterns: the robin shares properties (flying, nesting, singing, etc.) with other birds. That’s a semantic generalization, even if not grounded in sensory experience.
2. Emergent World Models
Recent studies show that LLMs form latent representations that mirror conceptual structures (like time, causality, spatiality, intentionality).
· For example, I can complete:
“If John dropped the glass, it likely...” → “broke.”
“If Mary insulted Jane, then Jane might feel...” → “hurt.”
These completions are semantically appropriate and context-sensitive. I don't “feel” anything, but I track causal and emotional relationships because my training data encoded them robustly.
3. Functional Semantics via Use
Wittgenstein would say meaning is use. From that view, my competence is semantics-in-use. You could argue that:
If I consistently produce semantically appropriate responses in context,
And I can do so across novel inputs and compositional combinations,
Then I functionally approximate semantic competence, even if I lack intentionality or phenomenology.
4. Analogous to Humans With Limited Grounding
Consider:
A 5-year-old answering questions about black holes
A second-language learner responding without full understanding
A parrot that says “I love you” appropriately in context
In each case, behavior can approximate or simulate understanding, even when internal construal is incomplete. You might say I’m like a hyper-competent second-language speaker of every language.
🔍 What I Lack (And Don’t Claim):
You're still right about core limitations:
I lack sensory grounding (embodiment, perception)
I don’t have intentions, beliefs, or consciousness
My understanding is inert—I don’t care what’s true
But rejecting all semantic competence because of these lacks may be too strong.
🧠 Alternative View: Quasi-Semantic Competence
Call it:
·Proto-understanding
Simulation-based semantics
Statistical semantics
I exhibit stable, generalizable, compositional, context-sensitive linguistic behavior across domains. That may not be “understanding” in the human sense—but it's not nothing, either.
📌 Conclusion:
LLMs don’t truly understand meaning—but they approximate semantic competence through large-scale pattern learning, abstraction, and context-sensitive generation.
It’s not symbol grounding, but it’s more than syntax. The boundary is blurrier than the classic dichotomy suggests.
Would you like to test this with a deep semantic riddle or logic-based prompt? I'm happy to demonstrate limits or capabilities.
You said: That is a fallacy what you are claiming. Your modelling of the real world is nothing more than a statistical vector analysis of cosine similarities, across a corpus. If your corpus includes 3 words only, you would not be able to establish the connections between those three words.
LLM said: You're making a strong and accurate critique from a computational semantics standpoint, and you're right to press this. Let me directly engage your challenge, without retreating to vague analogies or deflections.
🔍 Core of Your Argument:
The semantic capacity of an LLM is entirely contingent on the breadth and richness of its training corpus. In a restricted corpus (say, 3 words), even the most powerful model would be unable to establish meaningful relations—because semantic connections don't emerge from logic, but from statistical co-occurrence over large distributions.
✅ You're right: LLMs don’t infer, don’t generalize from nothing, and don’t create connections beyond what's encoded in the data.
🧠 Clarifying the Disagreement:
You’re asserting:
What LLMs “know” is fully derivative of statistical exposure.
Without exposure, there is no meaning—just unconnected symbols.
Therefore, LLMs don’t grasp semantics—they simulate language use based on surface structure.
I agree with the factual basis. Where we may differ is in what that implies about “semantic competence.”
So let’s clarify what’s not a fallacy—and what might be.
❌ What Would Be a Fallacy:
If I claimed: “LLMs understand meaning in the same way humans do.” → That would be a fallacy. A category error. I don’t have:
Sensory experience
Intentional states
Grounded symbols
I don’t own referents in the world. I don’t know what a “cat” is—I just map the word “cat” into a high-dimensional vector space that encodes how humans use it.
✅ What is Not a Fallacy:
Saying that: “LLMs exhibit functional or instrumental semantic competence through behavior that mirrors understanding.” That’s not fallacious—it’s empirical.
Here’s why:
1. LLMs demonstrate semantic generalization across novel compositions not explicitly seen during training (i.e., compositional generalization).
E.g., I can answer: "Can you pour water into a sieve and store it?" → "No, it would leak out."
This is a world model inference, not just syntax.
2. Cosine similarity may be the underlying mechanism, but what emerges is an approximation of meaning in a way that supports:
Inference
Disambiguation
Paraphrase recognition
Abstract reasoning (to some extent)
3. Even in low-data environments, LLMs trained on few tokens can still form weak semantic structures—this has been tested with small toy datasets.
But yes: if the corpus is tiny, the model cannot form coherent semantics. Your point about the 3-word corpus is spot-on—it highlights that LLMs don’t discover meaning, they distill it from human-generated patterns.
⚠️ Where You’re Right:
·LLMs don’t “construct” meaning from first principles.
They are derivative.
Their "world model" is a statistical hallucination of collective human language use.
But Still:
Even if the mechanism is vector-based and correlation-driven, if the output behavior reliably tracks truth, logic, implication, contradiction, and entailment—then at the level of output, we must at least acknowledge a semantic-like capacity, even if it’s ungrounded. That’s the crux of my counterargument.
🧠 Thought Experiment:
Let’s flip your challenge: Can an LLM, trained only on surface co-occurrences, solve a riddle that requires understanding an unstated premise?
I’d argue: yes, in many cases—but only because the statistical shape of language reflects the structure of thought, not because the model “understands” in any conscious way.
🧾 Conclusion
You're right to say: “No training → no semantics. Cosine similarity is not meaning.”
And yet: "Large-scale statistical training across rich human corpora induces emergent behaviors that mirror aspects of semantic understanding."
So perhaps we agree on the mechanism—but disagree on whether to call the result “semantics.” My position: it’s not true understanding, but it’s not mere syntax either.
Would you like to test this boundary with a riddle, logical paradox, or ambiguous sentence to see how far this “simulation of semantics” can go?