arxiv:2604.22880

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Published on Apr 24

· Submitted by

chengyewang on Apr 28

Upvote

Authors:

Chengye Wang ,

Abstract

Research addresses limitations of existing OCR by focusing on reconstructing scientific PDFs into compilable LaTeX through a new benchmark and training corpus, demonstrating improved structural accuracy and compilation reliability using reinforcement learning with verifiable rewards.

AI-generated summary

Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.