gpt-oss-120b-sft-aimo3-FishMath🐟

Supervised fine-tuned checkpoint of openai/gpt-oss-120b for olympiad-style mathematical reasoning.

This model was developed as part of our work for Kaggle AI Mathematical Olympiad - Progress Prize 3. It is intended for integer-answer contest math and can be used with plain text reasoning or tool-integrated reasoning using a Python execution environment.

Abstract

This checkpoint is the SFT model selected for our final AIMO3 solution. It was trained on FishMath SFT Data, a curated synthetic dataset of verified mathematical reasoning traces, including both no-tool and tool-integrated solutions.

The full solution also uses Python tool-integrated inference under a single-H100 / 5-hour evaluation budget. For the complete competition setup, see the AIMO3 solution writeup.

Model Details

This is a language model fine-tuned from openai/gpt-oss-120b for olympiad-style mathematical reasoning. It is intended primarily for integer-answer contest problems, with final answers written in \boxed{}. The model was trained with SFT on curated mathematical reasoning trajectories, including both plain text and Python tool-integrated solutions.

Training Data

The model was trained on FishMath SFT Data, our synthetic SFT dataset for mathematical reasoning developed for the AIMO3 project. The dataset contains verified solution traces for competition-level math problems, including both plain-text and Python tool-integrated reasoning traces.

The source problems are derived mainly from public math datasets such as Nemotron-Math-v2 and Crystal-Math-Preview. See the dataset card for construction details.

Training Summary

Item Value
Training trajectories used in this run 22,287
Max sequence length 65,536
Epochs 10
Global batch size 256
Effective token exposure ~10.37B tokens
Optimizer AdamW
Peak learning rate 8e-6
LR schedule cosine decay
Warmup 5%
Weight decay 0.1

Tool-integrated reasoning examples were upsampled during training.

Evaluation

Evaluation was conducted on an internal benchmark of 285 high-difficulty integer-answer math problems. Each problem was sampled 16 times.

  • pass@1: average single-sample accuracy
  • Maj@8: majority-vote accuracy over 8 sampled solutions
  • Pass@16: fraction of problems solved at least once across 16 samples

All numbers are percentages.

Inference Mode Model pass@1 ↑ Maj@8 ↑ Pass@16 ↑
TIR Original gpt-oss-120b 72.61 77.86 92.28
TIR FishMath🐟 SFT model 73.38 78.62 90.88
No TIR Original gpt-oss-120b 64.65 72.04 84.56
No TIR FishMath🐟 SFT model 65.79 72.31 83.51

Limitations

  • Specialized for contest-style mathematical reasoning.
  • Evaluation focuses on integer-answer problems. Proof-writing and non-math performance were not evaluated.
  • Results may vary with prompts, decoding settings, hardware, serving stack, and tool environment.
  • Outputs are not guaranteed to be correct.
Downloads last month
13
Safetensors
Model size
117B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SakanaAI/gpt-oss-120b-sft-aimo3-fishmath

Finetuned
(106)
this model

Dataset used to train SakanaAI/gpt-oss-120b-sft-aimo3-fishmath