Big new article from several AI researchers at Apple that’s been making the rounds that concludes that large language models (LLMs) don’t do formal reasoning and can be easily distracted by minor irrelevant information. Pretty dense 20+ page paper but Gary Marcus has an excellent but still in-depth summary on his site.
Marcus’ key takeaway from Apple’s AI research:
“We found no evidence of formal reasoning in language models… Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!”
Based on both his expertise and Apple’s findings, Marcus makes a definitive statement about LLM’s reasoning capabilities:
There is just no way can you build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer.
These findings align with my experience. While getting high-quality, accurate responses from LLMs is possible, it often requires careful prompting and iteration. LLMs are excellent tools that I use every day and am actively helping build AI products at the day job, but like most tools they have their limitations. What’s particularly noteworthy to me is that these same limitations were documented back in 2019, and while LLMs have made remarkable progress in many areas, their fundamental reasoning capabilities haven’t improved at nearly the same pace.
So what does all this mean? Does it mean AI tools are dead? Not at all. I am a big proponent of human-in-the-loop AI solutions that leverage the strengths of AI and iteratively improve with human review and intervention. With human oversight, model monitoring, and great AI product designers (of course), we can build powerful AI tools that help us do the work we all need to get done everyday even if those tools can’t do formal reasoning.