⑨ lab ≡ ByteLabs

Note/»AI in LLMs« — Absent Intelligence in Large Language Models

— Igor Böhm

This collection of random notes, quotes, and articles is work in progress…

Acronyms

A Large Language Model (LLM) analyses vast pools of information, to “learn” the statistical relationships between words and phrases.

A Large Multimodal Model (LMM) is an AI system capable of processing and generating content across multiple data types, or modalities, such as text, images, audio, and video.

Quotes

»Modern-AI works: not by reasoning logically, but by using statistical techniques to produce the most likely answer, based on an enormous training dataset.« — Betrand Meyer 1

»[The artificial intelligence chatbot ChatGPT] is going to change everything about how we do everything. I think that it represents mankind’s greatest invention to date. It is qualitatively different — and it will be transformational.« — Craig Mundie, former Chief Research and Strategy Officer for Microsoft 2

»To observe an A.I. system — its software, microchips and connectivity — produce that level of originality in multiple languages in just seconds each time, well, the first thing that came to mind was the observation by the science fiction writer Arthur C. Clarke that “any sufficiently advanced technology is indistinguishable from magic.”« — Thomas L. Friedman 2

»It was observed centuries ago that the normal use of language has quite curious properties, it is unbounded, its not random, it is not determined by external stimuli and there is no reason to believe that it is determined by internal states, it is uncaused but its somehow appropriate to situations, its coherent, it invokes thoughts in the hearer that he or she might have expressed in the same way.« — Noam Chomsky (1992 Killian Lecture)

»So there is a whole school of linguistics that comes from Chomsky that thinks that it’s complete nonsense to say [large language models] understand, that they don’t process language at all in the same way as we do. I think that school is wrong. I think it’s clear now that neural nets are much better at processing language than anything ever produced by the Chomsky School of Linguistics. But there’s still a lot of debate about that, particularly among linguists.« — Geoffrey Hinton (2024) 3

»It’s true there’s been a lot of work on trying to apply statistical models to various linguistic problems. I think there have been some successes, but a lot of failures. There is a notion of success … which I think is novel in the history of science. It interprets success as approximating unanalyszed data.« — Noam Chomsky (2011) 4

TODO: add Peter Norvig and Goeffrey Hinton quotes lavishly praising AI models.

TODO: add quotes from Breiman 5 about »The Two Cultures« in statistical modelling.

Notes on »The False Promise of ChatGPT« 6

TODO: Summarise and extrapolate key problems identified in 6

»Perversely, some machine learning enthusiasts seem to be proud that their creations can generate correct “scientific” predictions (say, about the motion of physical bodies) without making use of explanations (involving, say, Newton’s laws of motion and universal gravitation). But this kind of prediction, even when successful, is pseudoscience. While scientists certainly seek theories that have a high degree of empirical corroboration, as the philosopher Karl Popper noted, “we do not seek highly probable theories but explanations; that is to say, powerful and highly improbable theories.”«

»The theory that apples fall to earth because that is their natural place (Aristotle’s view) is possible, but it only invites further questions. (Why is earth their natural place?) The theory that apples fall to earth because mass bends space-time (Einstein’s view) is highly improbable, but it actually tells you why they fall. True intelligence is demonstrated in the ability to think and express improbable but insightful things.«

Notes on »GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models« 7

TODO: Summarise and extrapolate key problems identified in 7

»Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of this model. Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn’t contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.« 7

Notes on Peter Norvig’s »Chomsky and the Two Cultures of Statististical Learning« 8

TODO: Review Norvig’s rant 8 and cite Chomsky’s comment to Norvig’s fulminations from the Q&A of 9

Notes on Betrand Meyers »AI for software engineering: from probable to provable« 1

Why do we see CEOs force their engineers to use AI for work or else? 10 11

TODO: Betrand Meyer’s paper 1 has some excellent insights into why »the much-touted use of AI techniques for programming, faces two overwhelming obstacles: the difficulty of specifying goals (“prompt engineering” is a form of requirements engineering, one of the toughest disciplines of software engineering); and the hallucination phenomenon. Programs are only useful if they are correct or very close to correct.« 1

Notes on the application and use of LLMs/LMMs

The following is from a LinkedIn post by Simon Wardley:

  1. All outputs are hallucinations i.e. fabricated and ungrounded. Many of these outputs happen to match reality when there’s abundant training data and repetition, so they look useful on common tasks. But they cannot do research. These machines are stochastic parrots (Bender et al), they are pattern matchers and not reasoning engines.

  2. These systems will happily invent plausible seeming but unverified detail. That’s a design feature not a bug, they are optimised for coherence, not truth.

  3. These systems do not understand what they are creating. The use of tools and guardrails is mostly to convince you of their correctness and to hide their inner workings, they are about shaping perception and behaviour, not true comprehension. Yes, guardrails also reduce some classes of harm.

  4. These problems are not with the user and their prompting. Stop blaming users for what are design flaws and systematic issues.

  5. You cannot “swarm” your way out of these problems. Orchestration doesn’t solve fundamental epistemic limits. However, these systems (including agentic swarms) are extremely useful in the right context and are excellent for creating hypotheses (which then need to be tested).

  6. These systems can output long, convincing “scientific” documents full of fabricated metrics, invented methods, and impossible conditions without flagging uncertainty. They cannot be trusted for policy, healthcare, or serious research, because they are far too willing to blur fact and fiction.

  7. These systems can and should be used only as a drafting assistant (structuring notes, summarising papers) with all outputs fact-checked by humans that are capable in the field. Think of these systems as a calculator that sometimes “hallucinates” numbers - it should never be blindly trusted to do your tax return.

  8. The persuasive but false outputs can cause real harm. These systems are highly persuasive and are designed to be this - hence coherence, the appearance of “helpfulness” and the use of authoritative language.

  9. Being trained on market data, these systems exhibit large biases towards market benefit rather than societal benefit. Think of it like a little Ayn Rand on your shoulder whispering sovereign individual Kool-aid. In other words, the optimisation leans toward market benefit, not necessarily public good.

– Appendix

Many use the term hallucination as “error from reality”. This implies that the LLM/LMM reasons its way to the correct answers. I take a position that all output is “hallucinated” and sometimes that output matches reality where we have lots of training data and narrow contexts. I feel this fairly reflects the more statistical nature of LLM/LMMs as we haven’t built reasoning engines … yet.

Social Media References


  1. Betrand Meyer, AI for software engineering: from probable to provable, Software Engineering and Artificial Intelligence, 2025. ↩︎

  2. Thomas Friedman, Our New Promethean Moment, The New York Times, Mar.21, 2023. ↩︎

  3. Geoffrey Hinton, First Reactions; Telephone Interview, Nobel Prize, Oct.24, 2024. ↩︎

  4. Noam Chomsky, Comments made at the Brains, Minds, and Machines symposium held during MIT’s 150th birthday party in 2011, Technology Review. ↩︎

  5. Leo Breiman, Statistical Modeling: The Two Cultures, Statistical Science, Vol. 16, No. 3, 199-231, 2001. ↩︎

  6. Noam Chomsky, Ian Roberts, Jeffrey Watumull,Noam Chomsky: The False Promise of ChatGPT, The New York Times, Mar.8, 2023. ↩︎

  7. Iman Mirzadeh and Keivan Alizadeh and Hooman Shahrokhi and Oncel Tuzel and Samy Bengio and Mehrdad Farajtabar, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, Machine Learning and Artificial Intelligence, 2025. ↩︎

  8. Peter Norvig,Colorless Green Ideas Learn Furiously: Chomsky and the Two Cultures of Statistical Learning, Significance, Volume 9, Issue 4, Aug., 2012. ↩︎

  9. Noam Chomsky, Generative Grammar Program Talk CHON-LING019. Recorded in Princeton, NJ on November 12, 2013. ↩︎

  10. Lindsay Ellis, The Boss Has a Message: Use AI or You’re Fired, Wall Street Journal, Nov.7, 2025. ↩︎

  11. Callum Borchers, These AI Power Users Are Impressing Bosses and Leaving Co-Workers in the Dust, The Wall Street Journal, Nov.5, 2025. ↩︎

#Note