gpt-oss-120b-sft-aimo3-FishMath🐟
Supervised fine-tuned checkpoint of openai/gpt-oss-120b for olympiad-style mathematical reasoning.
This model was developed as part of our work for Kaggle AI Mathematical Olympiad - Progress Prize 3. It is intended for integer-answer contest math and can be used with plain text reasoning or tool-integrated reasoning using a Python execution environment.
Abstract
This checkpoint is the SFT model selected for our final AIMO3 solution. It was trained on FishMath SFT Data, a curated synthetic dataset of verified mathematical reasoning traces, including both no-tool and tool-integrated solutions.
The full solution also uses Python tool-integrated inference under a single-H100 / 5-hour evaluation budget. For the complete competition setup, see the AIMO3 solution writeup.
Model Details
This is a language model fine-tuned from openai/gpt-oss-120b for olympiad-style mathematical reasoning. It is intended primarily for integer-answer contest problems, with final answers written in \boxed{}. The model was trained with SFT on curated mathematical reasoning trajectories, including both plain text and Python tool-integrated solutions.
Training Data
The model was trained on FishMath SFT Data, our synthetic SFT dataset for mathematical reasoning developed for the AIMO3 project. The dataset contains verified solution traces for competition-level math problems, including both plain-text and Python tool-integrated reasoning traces.
The source problems are derived mainly from public math datasets such as Nemotron-Math-v2 and Crystal-Math-Preview. See the dataset card for construction details.
Training Summary
| Item | Value |
|---|---|
| Training trajectories used in this run | 22,287 |
| Max sequence length | 65,536 |
| Epochs | 10 |
| Global batch size | 256 |
| Effective token exposure | ~10.37B tokens |
| Optimizer | AdamW |
| Peak learning rate | 8e-6 |
| LR schedule | cosine decay |
| Warmup | 5% |
| Weight decay | 0.1 |
Tool-integrated reasoning examples were upsampled during training.
Evaluation
Evaluation was conducted on an internal benchmark of 285 high-difficulty integer-answer math problems. Each problem was sampled 16 times.
- pass@1: average single-sample accuracy
- Maj@8: majority-vote accuracy over 8 sampled solutions
- Pass@16: fraction of problems solved at least once across 16 samples
All numbers are percentages.
| Inference Mode | Model | pass@1 ↑ | Maj@8 ↑ | Pass@16 ↑ |
|---|---|---|---|---|
| TIR | Original gpt-oss-120b |
72.61 | 77.86 | 92.28 |
| TIR | FishMath🐟 SFT model | 73.38 | 78.62 | 90.88 |
| No TIR | Original gpt-oss-120b |
64.65 | 72.04 | 84.56 |
| No TIR | FishMath🐟 SFT model | 65.79 | 72.31 | 83.51 |
Limitations
- Specialized for contest-style mathematical reasoning.
- Evaluation focuses on integer-answer problems. Proof-writing and non-math performance were not evaluated.
- Results may vary with prompts, decoding settings, hardware, serving stack, and tool environment.
- Outputs are not guaranteed to be correct.
- Downloads last month
- 13
Model tree for SakanaAI/gpt-oss-120b-sft-aimo3-fishmath
Base model
openai/gpt-oss-120b