TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction
Abstract
Research addresses limitations of existing OCR by focusing on reconstructing scientific PDFs into compilable LaTeX through a new benchmark and training corpus, demonstrating improved structural accuracy and compilation reliability using reinforcement learning with verifiable rewards.
Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multimodal OCR: Parse Anything from Documents (2026)
- Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models (2026)
- Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation (2026)
- Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training (2026)
- Qianfan-OCR: A Unified End-to-End Model for Document Intelligence (2026)
- TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering (2026)
- ParseBench: A Document Parsing Benchmark for AI Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper