Motivation. Artificial Intelligence (AI) agents powered by Large Language Models (LLMs) have demonstrated advanced capabilities in automating complex tasks. With the rapid evolution of LLMs’ reasoning abilities, such agents have achieved notable success in a variety of domains, including software development and robotics. More recently, the emergence of Multimodal Foundation Models (MFMs) marks a significant advancement in artificial intelligence. These models integrate multiple sensory modalities—such as vision, language, and audio—to substantially enhance the cognitive and perceptual capabilities of AI agents. Recent breakthroughs reveal that MFMs enable fine-grained cross-modal reasoning by jointly processing visual, textual, and auditory data. These systems excel at parsing complex scenarios, generating context-aware descriptions, and tackling tasks requiring synergistic perception and language understanding—capabilities critical to robotics, human-computer interaction, and beyond. Thus, this workshop aims to explore how AI agents can be further empowered through MFMs to operate effectively in multimodal environments.

Background & Application. Recent advances have demonstrated the potential of AI agents to operate in increasingly complex, multimodal settings. For instance, OS-Copilot exhibits human-like interaction capabilities with operating systems, encompassing activities such as web browsing, programming, and engaging with third-party applications. Another prominent application area involves AI Scientist Agents, designed to automate research workflows—including experiment design, execution, and analysis. However, existing implementations of such agents are often limited to processing textual inputs, thereby overlooking the rich, complementary information available from visual and auditory modalities. Lastly, Embodied Agents represent another rapidly evolving area, where AI agents interact with the physical (or simulated) world using sensors and robotic effectors. These applications exemplify the growing need for agents that can reason across modalities to achieve human-level situational awareness and actionability.

Challenges. Multimodal reasoning introduces a distinct set of theoretical and technical challenges that extend beyond those faced in unimodal contexts. First, integrating heterogeneous data (text, images, video) demands innovations in model architectures, training paradigms, and evaluation frameworks. The increased computational demands associated with processing multimodal inputs also raise concerns regarding scalability, efficiency, and real-time performance.

Second, applying these models to frontier domains—such as multi-agent collaboration, scientific discovery, and embodied intelligence systems—requires overcoming bottlenecks in cross-modal semantic understanding and knowledge transfer. Crucially, the interpretability and robustness of multimodal reasoning systems remain unresolved foundational issues, directly impacting the deployment of reliable real-world applications (e.g., scientific AI agents, bio-inspired robotics). Addressing these challenges necessitates breakthroughs in cross-modal representation alignment, dynamic attention mechanisms, and uncertainty modeling of multi-source information.

About this Workshop. All discussions under the scope of multimodal reasoning are welcome. This workshop aims to bring. together researchers with various backgrounds to study the next generation of multimodal reasoning systems. To foster an inclusive dialogue and debate space, we invite speakers and panelists from diverse backgrounds and areas of expertise. Our roster includes both renowned researchers and emerging investigators who have driven promising advances in the field.

Call for Papers

This workshop primarily focuses on the advancement of MFM-based Agents from the perspective of enhanced multimodal perception and reasoning. To foster an inclusive environment for discussion and debate, we welcome speakers and panelists from diverse backgrounds and expertise. Our lineup features distinguished researchers alongside emerging investigators who have made significant contributions to the field. Spotlight and poster sessions will highlight new ideas, key challenges, and retrospective insights related to the workshop’s themes.

Relevant topics include, but are not limited to:

  • How can we achieve semantically consistent alignment across vision, language, and audio modalities?
  • What novel training paradigms can mitigate the computational burden of multimodal systems?
  • What novel model architecture are effective and native modelmodal reasoners?
  • How do we quantify and enhance the causal reasoning capabilities of multimodal foundation models?
  • Can we develop unified metrics for evaluating cross-modal reasoning in open-ended scenarios?
  • How can we incentivize reasoning native in multimodal representations?
  • How to balance the reliance on different multimodal signals for inference and reasoning,
  • How to reduce the computational demands introduced by highly redundant modalities, such as images and videos.
  • How the multimodal signal influences the behavior of the LMM Agent.

Submission:

Submission: Submit to OpenReview

Submission Guideline:

  • Paper Formatting: Papers are limited to eight pages, including figures and tables (reference not included), in the CVPR style.
  • Double Blind Review: CVPR reviewing is double blind, in that authors do not know the names of the area chairs or reviewers for their papers, and the area chairs/reviewers cannot, beyond a reasonable doubt, infer the names of the authors from the submission and the additional material.

Archival Policy: This workshop is non-archival and will not result in proceedings; workshop submissions can be submitted to other venues.

Dual Submission: We welcome papers that may have been already accepted at CVPR 2026 but which are also relevant to this workshop, or papers under reviews at other venues (e.g., ICML 2026).

Key Dates:

  • Paper Submission Open: February 1th 2026, 23:59 AoE Time
  • Paper Submission Deadline: March 15th 2026, 23:59 AoE Time
  • Acceptance Notification: TBD
  • Camera-Ready Deadline: TBD

Review Guide

Thank you for your interest in the 2nd Workshop on Multi-Modal Reasoning for Agentic Intelligence (MMRAgI) at CVPR 2026. Your expertise and dedication contribute greatly to the success of this event.

Review:

  • Confidentiality: All review assignments and the content of the papers you review should be kept confidential. Do not share these materials or discuss them with others unless they are also reviewers for the same paper.
  • Conflict of Interest: If you recognize a conflict of interest with any paper you are assigned to review, please notify the program chairs immediately.
  • Length Requirement: We recommend paper submission within 8 pages (excluding references).
  • Review Criteria:
    • (1) Relevance: Does the paper align with the theme of the workshop, i.e., multi-llm-agent systems?
    • (2) Originality: Does the paper present new ideas or results, or does it significantly build upon previous work?
    • (3) Technical Soundness: Both position papers and methodology papers are acceptable. Is the opinion or methodology correct and properly explained? Are the claims supported by theoretical analysis or experimental results?
    • (4) Clarity: Is the paper well-written and well-structured? Is it easy for readers to understand the problem, the approach, and the results?
    • (5) Impact: If the results are applied, do they have the potential to contribute to the advancement of multi-llm-agent systems?

Speakers (tentative)

Ranjay Krishna

Assistant Professor, University of Washington.

Alexander Toshev

Research Scientist and Manager, Apple ML Research.

Yunzhu Li

Assistant Professor, Columbia University.

Kristen Grauman

Professor, University of Texas at Austin.
-->

Organization

Workshop Organizers

William Yijiang Li

Ph.D, UC San Diego.

Zhenfei Yin

Rising Star Researcher, Ph.D, University of Sydney, Postdoctoral Researcher, Oxford.

Lucy Shi

Ph.D, Stanford University.

Annie S. Chen

Ph.D, Stanford University.

Zixian Ma

Ph.D, the University of Washington.

Mahtab Bigverdi

Ph.D, the University of Washington.

Amita Kamath

Ph.D, the University of Washington and the University of California, Los Angeles.

Anda Raluca Epure

Ph.D, University of Oxford.

Alexander Toshev

Research Scientist and Manager, Apple ML Research.

Ranjay Krishna

Assistant Professor, University of Washington.

Philip Torr

Professor, University of Oxford.