MMMU-Pro
FundamentalsA rigorous multimodal AI benchmark with college-level questions across six disciplines that tests whether models truly understand visual and textual information together.
MMMU-Pro (Massive Multi-discipline Multimodal Understanding and Reasoning - Pro) is a hardened version of the MMMU benchmark designed to rigorously evaluate multimodal AI models on tasks requiring both visual understanding and expert-level reasoning. It covers six core disciplines: Art and Design, Business, Science, Health and Medicine, Humanities and Social Science, and Technology and Engineering, using questions drawn from college exams, quizzes, and textbooks.
MMMU-Pro improves on its predecessor through three key steps. First, it filters out questions that text-only models can answer correctly, ensuring that visual understanding is genuinely required. Second, it increases the number of answer options from 4 to 10, reducing the effectiveness of guessing from 25% to 10% random chance. Third, it introduces a vision-only input setting where the question text is embedded within images, testing whether models can extract and reason about information presented visually.
These changes make MMMU-Pro substantially harder than MMMU. Model performance drops significantly compared to the original benchmark, with scores typically falling by 17-27 percentage points. Chain-of-thought reasoning generally helps, but OCR-based prompting strategies show minimal improvement. MMMU-Pro has become a standard benchmark for evaluating frontier multimodal models, providing a more realistic assessment of their ability to handle complex visual reasoning tasks that mirror real-world academic and professional scenarios.
Related Terms
Last updated: February 26, 2026