gpt-oss-120b-sft-aimo3-FishMath🐟

Supervised fine-tuned checkpoint of openai/gpt-oss-120b for olympiad-style mathematical reasoning.

This model was developed as part of our work for Kaggle AI Mathematical Olympiad - Progress Prize 3. It is intended for integer-answer contest math and can be used with plain text reasoning or tool-integrated reasoning using a Python execution environment.

Abstract

This checkpoint is the SFT model selected for our final AIMO3 solution. It was trained on FishMath SFT Data, a curated synthetic dataset of verified mathematical reasoning traces, including both no-tool and tool-integrated solutions.

The full solution also uses Python tool-integrated inference under a single-H100 / 5-hour evaluation budget. For the complete competition setup, see the AIMO3 solution writeup.

Model Details

This is a language model fine-tuned from openai/gpt-oss-120b for olympiad-style mathematical reasoning. It is intended primarily for integer-answer contest problems, with final answers written in \boxed{}. The model was trained with SFT on curated mathematical reasoning trajectories, including both plain text and Python tool-integrated solutions.

Training Data

The model was trained on FishMath SFT Data, our synthetic SFT dataset for mathematical reasoning developed for the AIMO3 project. The dataset contains verified solution traces for competition-level math problems, including both plain-text and Python tool-integrated reasoning traces.

The source problems are derived mainly from public math datasets such as Nemotron-Math-v2 and Crystal-Math-Preview. See the dataset card for construction details.

Training Summary

Item	Value
Training trajectories used in this run	22,287
Max sequence length	65,536
Epochs	10
Global batch size	256
Effective token exposure	~10.37B tokens
Optimizer	AdamW
Peak learning rate	8e-6
LR schedule	cosine decay
Warmup	5%
Weight decay	0.1

Tool-integrated reasoning examples were upsampled during training.

Evaluation

Evaluation was conducted on an internal benchmark of 285 high-difficulty integer-answer math problems. Each problem was sampled 16 times.

pass@1: average single-sample accuracy
Maj@8: majority-vote accuracy over 8 sampled solutions
Pass@16: fraction of problems solved at least once across 16 samples

All numbers are percentages.

Inference Mode	Model	pass@1 ↑	Maj@8 ↑	Pass@16 ↑
TIR	Original `gpt-oss-120b`	72.61	77.86	92.28
TIR	FishMath🐟 SFT model	73.38	78.62	90.88
No TIR	Original `gpt-oss-120b`	64.65	72.04	84.56
No TIR	FishMath🐟 SFT model	65.79	72.31	83.51

Limitations

Specialized for contest-style mathematical reasoning.
Evaluation focuses on integer-answer problems. Proof-writing and non-math performance were not evaluated.
Results may vary with prompts, decoding settings, hardware, serving stack, and tool environment.
Outputs are not guaranteed to be correct.

Downloads last month: 13

Safetensors

Model size

117B params

Tensor type

BF16

Model tree for SakanaAI/gpt-oss-120b-sft-aimo3-fishmath

Base model

openai/gpt-oss-120b

Finetuned

(106)

this model

SakanaAI
/

gpt-oss-120b-sft-aimo3-fishmath