Phi-4-reasoning is a 14-billion parameter model specialized in complex reasoning tasks. It is trained using supervised finetuning (SFT) on diverse prompts and reasoning demonstrations from o3-mini. The model generates detailed reasoning chains and leverages inference-time compute effectively. Phi-4-reasoning-plus, an enhanced version with reinforcement learning (RL), delivers even higher performance by generating longer reasoning traces.
Despite their smaller size (14B parameters), Phi-4-reasoning and Phi-4-reasoning-plus are competitive with or exceeding much larger open weight (QwQ-32B, DeepSeek R1- Distill-Llama-70B, DeepSeek-R1) and closed (o1-mini, Claude Sonnet 3.7) reasoning models across several benchmarks as shown in Figures 1, 3 and Tables 1, 2. Our extensive benchmarks span math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well.
Figure 1. Performance comparison on representative reasoning benchmarks spanning mathematics (HMMT, AIME 25, OmniMath), scientific (GPQA), and coding (LiveCodeBench 8/24-1/25) domains.
Notably, Phi-4-reasoning and Phi-4-reasoning-plus achieve better performance than o1-mini, and DeepSeek-R1-Distill-Llama-70B at most benchmarks and achieve performance comparable to the full DeepSeek-R1 model (with 671B parameters) on AIME 20251 (the 2025 qualifier for the USA Math Olympiad). They also outperform Claude 3.7 Sonnet and Gemini 2 Flash Thinking on all tasks except GPQA (PhD-level STEM questions) and Calendar Planning.
More Potential with Parallel Test-time Scaling: As shown in Figure 2, our small-ish model nearly saturates performance on AIME 2025 with increasing parallel test-time compute (e.g., Majority @N), surpassing the pass@1 of the teacher (o3-mini).

Figure 2: Effects of parallel test-time compute on AIME 2025

Table 1. Average Pass@1 accuracy on selected reasoning benchmarks. Bold denotes best model per benchmark and model class (open versus closed-weight), and underline denotes the second best. We report the standard deviation in parentheses where available.
Key contributors to best-in-class performance
Below we summarize the core contributions that led to the superior performance of Phi-4-reasoning models. We provide more comprehensive technical details and experimentations surrounding each bullet point in our tech repot [1].
- Careful Data Curation: our reasoning prompts are specifically filtered to cover a range of difficulty levels and to lie at the boundary of the base model capabilities. Our approach aligns closely with data-centric methods of earlier Phi and Orca models [2,3,4,5,6,7,8], demonstrating that meticulous data curation and high-quality synthetic datasets allow smaller models to compete with larger counterparts. The datasets used in supervised finetuning include topics in STEM (science, technology, engineering, and mathematics), coding, and safety-focused tasks. Our reinforcement learning is conducted on a small set of high-quality math-focused problems with verifiable solutions.
- Benefits of Supervised Finetuning (SFT): Phi-4-reasoning after the SFT stage already performs strongly across diverse benchmarks. Interestingly, the improvement in performance generalizes tasks not directly targeted in the training data—such as calendar planning and general-purpose benchmarks (Table 2). We highlight the critical role of data mixture and training recipe in unlocking reasoning capabilities during the SFT stage, which goes hand-in-hand with our data selection and filtering.
- Boost with Reinforcement Learning: we are encouraged by the gains achieved through a short round of outcome-based reinforcement learning (RL) and the potential of combining distillation/SFT and reinforcement learning. We observe that the model after RL provides higher accuracy on math while using approximately 1.5x more tokens than the SFT model on average, offering a trade-off between accuracy and inference-time compute.
We think that reasoning is a transferable meta-skill that can be learned through supervised finetuning alone and further enhanced with reinforcement learning. To test the generalization of the models’ reasoning capabilities, we evaluate them on multiple new reasoning benchmarks that require algorithmic problem solving and planning, including 3SAT (3-literal Satisfiability Problem), TSP (Traveling Salesman Problem), and BA-Calendar planning. These reasoning tasks are nominally out-of-domain for the models as the training process did not target these skills, but the models show strong generalization to these tasks as shown in Figure 2.

Table 2. Average pass@1 accuracy on general-purpose benchmarks, averaged across five runs.
This generalized improvement in capabilities also goes beyond reasoning. Without explicit training on non-reasoning tasks, we saw significant improvements on IFEval, FlenQA, and internal PhiBench as shown in Table 2. And despite limited coding data during the SFT stage (and none during RL), the model performs well, scoring at o1-mini level on LiveCodeBench (LCB) and Codeforces as shown in Table 1. We plan to emphasize coding further in our future versions.

Figure 3. Average Pass@1 performance on reasoning benchmarks, averaged across five runs. Except for GPQA, other benchmarks are out-of-distribution with respect to Phi-4-reasoning’s training data.
Lessons on Evaluating Reasoning Models
Language models exhibit large generation nondeterminism, i.e., they may produce substantially different answers given the same prompts and inference hyperparameters (e.g., temperature). To account for this stochastic nature, we study the accuracy distribution on AIME 2025, approximated by kernel density estimation of 50 independent runs with the same prompt and temperature. We have found several interesting observations as illustrated in Figure 4:
- All models show a high accuracy variance. For example, accuracy of answers generated by DeepSeek-R1- Distill-Llama-70B ranges from 30% to 70%, while o3-mini’s accuracy ranges from 70% to 100%. This suggests that any comparison among models using a single run can easily produce misleading conclusions.
- Models on the two extremes of average accuracy demonstrate more robust accuracy. For example, Phi-4-reasoning-plus and Phi-4 have relatively narrower accuracy ranges compared to DeepSeek-R1-Distill-Llama-70B and Phi-4-reasoning.
- The accuracy distribution further indicates the competitive performance of Phi-4-reasoning-plus, largely intersecting with o3-mini’s distribution and being almost disjoint from DeepSeek-R1-Distill-Llama-70B’s distribution.

Figure 4. Distribution of pass@1 accuracy on AIME 2025, approximated by kernel density estimation over 50 runs with the same prompt and temperature. The accuracy distribution further shows the competitive performance of Phi-4-reasoning-plus, largely intersecting with o3-mini’s distribution and being almost disjoint from the DeepSeek-R1-Distill-Llama-70B.
Phi-4-Reasoning in action
Below we provide some interesting example responses from Phi-4-reasoning that showcases its intelligent behavior.

Prompt: “Generate a website for steves pc repairs using a single html script”
Prompt: “write a Python program that shows a ball bouncing inside a spinning triangle. The ball must bounce off the rotating walls realistically and should not leave the triangle”
References
[1] “Phi-4-reasoning Technical Report.” arXiv preprint arXiv:2504.21318 (2025). [link (opens in new tab)]
[2] “Phi-4 technical report.” arXiv preprint arXiv:2412.08905 (2024).
[3] “Phi-3 technical report: A highly capable language model locally on your phone.” arXiv preprint arXiv:2404.14219 (2024).
[4] “Phi-2: The surprising power of small language models.” Microsoft Research Blog (2023).
[5] “Textbooks are all you need.” arXiv preprint arXiv:2306.11644 (2023).
[6] “Agentinstruct: Toward generative teaching with agentic flows.” arXiv preprint arXiv:2407.03502 (2024).
[7] “Orca 2: Teaching small language models how to reason.” arXiv preprint arXiv:2311.11045 (2023).
[8] “Orca: Progressive learning from complex explanation traces of gpt-4.” arXiv preprint arXiv:2306.02707 (2023).