Gemini 3 Pro and GPT-5 still fail at complex physics tasks designed for real scientific research

Gemini 3 Pro and GPT-5 Fall Short on Complex Physics Benchmarks for Genuine Scientific Research

Large language models (LLMs) have made remarkable strides in natural language processing, coding, and even basic reasoning tasks. However, when confronted with the rigorous demands of real-world scientific research—particularly in physics—cutting-edge models like Google’s Gemini 3 Pro and OpenAI’s GPT-5 previews reveal significant limitations. A recent evaluation using specialized physics benchmarks underscores that these systems still struggle to perform at the level required for graduate-level or professional scientific inquiry.

The assessment draws from established benchmarks such as GPQA (Graduate-Level Google-Proof Q&A), which features questions crafted by domain experts to probe deep understanding in physics, chemistry, and biology. These are not rote memorization tests; they demand novel reasoning, integration of multiple concepts, and handling of edge cases that mimic authentic research challenges. GPQA’s physics subset, in particular, includes problems involving quantum mechanics, general relativity, statistical mechanics, and condensed matter physics—areas where superficial pattern matching falls apart.

Gemini 3 Pro, Google’s latest flagship model with enhanced multimodal capabilities and a massive context window, was put to the test alongside GPT-5 candidates, including reasoning-focused variants like o1-preview. Evaluators prompted the models with raw problem statements, without additional scaffolding or chain-of-thought hints beyond standard configurations. The results were telling: Gemini 3 Pro achieved accuracy rates hovering around 40-50% on GPQA physics questions, while GPT-5 previews fared marginally better at 45-55%, depending on the specific variant and temperature settings.

To contextualize these scores, human baselines provide stark contrast. PhD-level physicists score above 70% on GPQA, with experts in the relevant subfields often exceeding 80%. Even undergraduate physics majors average around 35%, highlighting that top LLMs are not surpassing novice levels in this domain. The failures are not random; they cluster around tasks requiring multi-step derivations, symmetry arguments, or approximations grounded in physical intuition.

Consider a typical failure mode observed in the evaluation. A problem on quantum field theory might ask for the computation of a scattering amplitude in a specific gauge, demanding renormalization techniques and Feynman diagram evaluation. Gemini 3 Pro often generates plausible-looking expressions but errs in dimensional analysis or overlooks subtle divergences. GPT-5, leveraging its internal reasoning chains, sometimes identifies the correct approach but hallucinates intermediate steps, such as misapplying Wick’s theorem or confusing axial with vector currents. These lapses stem from the models’ reliance on statistical correlations in training data rather than verifiable physical principles.

Another benchmark highlighted in the analysis is the SciBench suite, tailored for scientific reasoning in physics. It includes tasks like deriving equations of state for exotic matter phases or predicting outcomes in non-equilibrium thermodynamics. Here, Gemini 3 Pro’s performance dipped below 30% for open-ended derivations, while GPT-5 managed 35-40%. The models excelled at recall-heavy questions, such as restating the Schwarzschild metric, but crumbled when perturbations or novel parameter regimes were introduced.

Visualizations from the study plot performance against model size and training compute. Despite exponential scaling in parameters—Gemini 3 Pro rumored at trillions of tokens—diminishing returns are evident. Post-training alignments, including safety fine-tuning and reinforcement learning from human feedback (RLHF), appear to exacerbate issues in technical domains by prioritizing fluency over precision.

The evaluation also tested tool-use integrations, such as pairing models with symbolic solvers like SymPy or numerical simulators. Even augmented, success rates topped out at 60%, insufficient for research-grade reliability. For instance, in a general relativity task involving geodesic equations on a Kerr black hole, models correctly invoked the metric but failed to integrate conserved quantities properly, leading to divergent orbits.

These shortcomings have profound implications for scientific workflows. Researchers increasingly experiment with LLMs for hypothesis generation, literature synthesis, or code debugging in simulations. Yet, the benchmarks reveal over-reliance risks: erroneous derivations could propagate through papers or experiments, wasting resources. The study advocates hybrid approaches, where LLMs handle rote tasks but humans oversee critical reasoning.

Comparisons with prior models paint a trajectory of slow progress. GPT-4 scored ~30% on GPQA physics, Claude 3.5 Sonnet ~42%, and Llama 3.1 405B ~38%. Gemini 3 Pro and GPT-5 represent incremental gains, but the gap to human expertise persists. Factors like data contamination—where training corpora include benchmark-adjacent solutions—are ruled out via Google-proofing, confirming genuine capability deficits.

Training paradigms may need rethinking. Current methods emphasize next-token prediction, excelling at interpolation but faltering on extrapolation central to physics discovery. Proposals include physics-specific pretraining on arXiv derivations, synthetic data from symbolic engines, or test-time compute scaling via verifiers. Nonetheless, the evaluation cautions against hype: LLMs are powerful assistants, not autonomous researchers.

In summary, while Gemini 3 Pro and GPT-5 push boundaries in breadth, depth in complex physics remains elusive. True scientific AI will demand innovations beyond scale, toward verifiable reasoning and physical grounding.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.