LLMs crush coding and math but choke on casual questions, and that's not a contradiction

LLMs Excel in Coding and Math but Struggle with Casual Questions, and That Is Not a Contradiction

Large language models (LLMs) have demonstrated remarkable prowess in specialized domains like coding and mathematics, consistently achieving top scores on rigorous benchmarks. Models such as OpenAIs GPT-4o and Anthropics Claude 3.5 Sonnet dominate leaderboards for tasks like HumanEval, a coding benchmark that evaluates the ability to complete programming problems, and GSM8K, a dataset of grade-school math word problems. These achievements have fueled excitement about AI’s potential to revolutionize software development and scientific computation. Yet, the same models often falter on seemingly simple, casual questions that require everyday common sense or real-world knowledge. This apparent paradox is not a flaw in the models but a reflection of fundamental differences in how these tasks are constructed and how LLMs are trained.

Consider the performance metrics. On LiveCodeBench, a challenging coding evaluation that includes recent problems to minimize data contamination, GPT-4o scores an impressive 72.0 percent, while Claude 3.5 Sonnet reaches 75.8 percent. In mathematics, both models solve over 95 percent of GSM8K problems correctly. These results stem from the structured nature of the tasks. Coding benchmarks like HumanEval present partial functions with clear specifications, test cases, and unambiguous success criteria. A correct solution either passes all tests or it does not. Similarly, math problems in GSM8K follow logical steps with precise arithmetic and definitional rules, allowing models to mimic patterns encountered frequently during training.

In contrast, casual questions probe nuanced, context-dependent knowledge that defies such rigidity. For instance, when asked “What should I do if I spill hot coffee on my laptop?”, top models provide advice ranging from the sensible (“Unplug it immediately and wipe with a dry cloth”) to the bizarre (“Put it in rice” or “Turn it upside down and shake”). Even basic queries like “How many legs does a dog have?” elicit occasional errors, such as claims of three or five legs. On the SimpleQA benchmark, which tests straightforward factual recall without tricks, GPT-4o answers only 58 percent correctly, Claude 3.5 Sonnet fares slightly better at 62 percent, and even Gemini 1.5 Pro lags at 55 percent. These rates are shockingly low for questions any human would answer flawlessly.

The root cause lies in the training data. LLMs are pretrained on vast internet corpora, where coding snippets, math solutions, and textbook explanations abound. GitHub repositories, Stack Overflow threads, and academic papers provide millions of examples with correct answers explicitly labeled or inferable through test outcomes. This abundance enables models to internalize reliable patterns for these domains. Casual questions, however, draw from diverse, noisy real-world sources like forums, blogs, and conversations, where answers vary by context, opinion, or error. A spilled coffee thread might include myths like the rice remedy alongside valid tips, diluting the signal for correct responses.

Moreover, evaluation methods highlight the disparity. Coding and math use automated grading: execute the code or compute the result, and verify against ground truth. Casual questions rely on human-annotated correctness, which introduces subtlety. What counts as a “good” answer to “Should I wear a jacket today?” depends on location, weather, and personal tolerance, but benchmarks simplify to binary right or wrong based on consensus facts.

This is not merely a data issue; it underscores LLMs’ reliance on statistical next-token prediction rather than genuine understanding. In coding, predicting the next line often aligns with functional correctness due to repetitive structures. Math benefits from chain-of-thought prompting, where step-by-step reasoning mirrors training examples. Casual queries demand holistic reasoning, integrating sparse facts without clear paths, exposing limitations in world modeling.

Recent studies reinforce this view. The ARC-Challenge, testing abstract reasoning, sees LLMs score below 50 percent despite billions of parameters. Conversely, specialized fine-tuning boosts coding and math performance dramatically, as seen in models like DeepSeek-Coder-V2, which tops coding charts through targeted training.

Implications for users are clear: leverage LLMs for verifiable, structured tasks where outputs can be checked programmatically. For casual advice, treat responses skeptically and cross-verify. Developers should prioritize benchmarks matching real use cases and invest in retrieval-augmented generation for factual grounding.

This performance gap demystifies LLM capabilities. Excellence in coding and math reflects data richness and evaluability, not superior intelligence. Struggles with casual questions reveal reliance on patterns over comprehension. Far from contradictory, these traits illuminate the path forward: curate better data, refine evaluations, and integrate tools for robust AI.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.