Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models Paper • 2603.01571 • Published 8 days ago • 32
RubricBench: Aligning Model-Generated Rubrics with Human Standards Paper • 2603.01562 • Published 8 days ago • 55
Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation Paper • 2602.16990 • Published 20 days ago • 11
Ebisu: Benchmarking Large Language Models in Japanese Finance Paper • 2602.01479 • Published Feb 1 • 17