Reasoning models now ace all three CFA exam levels

AI Reasoning Models Conquer All Three Levels of the CFA Exam

Large language models equipped with advanced reasoning capabilities have achieved a groundbreaking milestone: passing all three levels of the Chartered Financial Analyst (CFA) examination with flying colors. Models such as OpenAI’s o1 series, Google’s Gemini 2.0 Flash Thinking Experimental, and Anthropic’s Claude 3.5 Sonnet with extended thinking have demonstrated exceptional performance across the rigorous CFA curriculum, which tests quantitative methods, economics, financial reporting, corporate finance, equity, fixed income, derivatives, alternative investments, portfolio management, and wealth planning.

The CFA program, administered by the CFA Institute, is renowned as one of the most challenging professional certifications in finance. It comprises three sequential levels, each building on the previous one with increasing complexity:

  • Level I: Focuses on foundational knowledge through 180 multiple-choice questions, emphasizing tools and inputs for investment analysis.
  • Level II: Delves into asset valuation via item-set questions, requiring analysis of complex scenarios.
  • Level III: Demands synthesis and evaluation through constructed response (essay-style) questions, testing the ability to formulate and justify investment recommendations.

Historically, AI models have faltered on these exams, particularly at higher levels. Earlier benchmarks, such as those from 2023, showed GPT-4 achieving only 48% on Level I, Claude 2 at 76%, and PaLM 2 at 67%. Performance dropped sharply for Level II and III, where nuanced reasoning and essay construction proved insurmountable. Even GPT-4o, a more recent model, scored 67% on Level I, 61% on Level II, and 63% on Level III.

This landscape shifted dramatically with the advent of reasoning models, which employ chain-of-thought prompting and iterative thinking processes to mimic human problem-solving. These models generate intermediate reasoning steps before arriving at final answers, enabling deeper analysis of multifaceted problems.

Benchmark Results: Top Performers Across All Levels

A comprehensive evaluation conducted by Valeriia Chernova, shared on X (formerly Twitter), tested leading reasoning models on official CFA practice exams from 2024. The results are staggering:

Model Level I Score Level II Score Level III Score
OpenAI o1 96% 98% 94%
OpenAI o1-preview 100% 100% 100%
OpenAI o1-mini (high) 100% 98% 100%
OpenAI o3-mini (high) 100% 100% 100%
OpenAI o3-mini (medium) 100% 100% 94%
Google Gemini 2.0 Flash Thinking Experimental 93% 100% 94%
Anthropic Claude 3.5 Sonnet (extended thinking) 85% 100% 88%

OpenAI’s o1-preview and o1-mini high configurations achieved perfect or near-perfect scores across the board, effectively “acing” the exams. Gemini 2.0 Flash Thinking Experimental excelled on Level II with 100%, while Claude 3.5 Sonnet, using a custom extended thinking mode, also cleared all levels despite slightly lower Level I and III results.

The testing methodology was rigorous. Exams were sourced directly from CFA Institute materials, administered via APIs without external tools or retrieval-augmented generation. For Level III essays, models generated full responses scored against official rubrics. Time limits mirrored real exams: 4.5 hours for Level I, 4.5 hours across two sessions for Level II, and 4.5 hours for Level III. Pass thresholds are approximately 70% for all levels, which every tested reasoning model surpassed.

Key Factors Behind the Success

Several innovations underpin this leap in performance:

  1. Chain-of-Thought Reasoning: Models like o1 internally simulate step-by-step deliberation, reducing errors in quantitative calculations and qualitative judgments.
  2. Extended Context Windows: Handling the voluminous CFA curriculum requires processing extensive prompts without truncation.
  3. Iterative Refinement: Features such as o1’s “thinking tokens” and Claude’s extended thinking allow multiple passes over problems, akin to human double-checking.
  4. Specialized Fine-Tuning: While base models were used, reasoning enhancements target finance-specific challenges like discounted cash flow models, option pricing, and portfolio optimization.

Notably, non-reasoning models lagged: GPT-4o scored below passing on Level II (61%), and Gemini 1.5 Pro managed only 79% on Level I.

Implications for Finance and AI

This achievement signals that reasoning models are maturing into tools capable of professional-grade financial analysis. They can now tackle exams requiring not just recall but synthesis, valuation, and recommendation formulation—skills central to CFA charterholders managing trillions in assets.

For the finance industry, implications abound. AI could automate routine analysis, freeing analysts for strategic work. However, limitations persist: models may hallucinate in edge cases, lack real-time data, and cannot replace ethical judgment or client interaction. Regulatory bodies like the CFA Institute emphasize human oversight.

From an AI research perspective, these results validate scaling reasoning compute. OpenAI’s o1 family, trained with reinforcement learning on synthetic reasoning data, exemplifies how “test-time compute” boosts capability without proportional training costs.

As benchmarks evolve, future tests may incorporate live market data or adversarial scenarios. Yet, for now, reasoning models have unequivocally mastered the CFA triad, marking a pivotal moment in AI’s encroachment on expert domains.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.