Researchers at MiroMind AI and several Chinese universities have released OpenMMReasoner, a new training framework that improves the capabilities of language models in multimodal reasoning.
The framework uses a two-stage process. It first refines a base model with a curated dataset in a supervised fine-tuning (SFT) stage. Then, a reinforcement learning (RL) stage guides the model to reason more effectively in tasks that involve both text and visual data.
Experiments show that models trained with OpenMMReasoner outperform other leading visual reasoning models, often while being trained on a smaller, higher-quality dataset. The framework and all its assets, including a trained 7B model, are fully open source, providing a reliable foundation for building applications that require traceability and robustness.
According to Kaichen Zhang, co-author of a research paper that outlines the new method, OpenMMReasoner offers significant benefits for businesses looking beyond large, closed systems. „A smaller open-source reasoning model has practical advantages: Enterprises can deploy it locally, reduce latency, lower token costs associated with long chains of thought, maintain full control over their data and [it is] fine-tunable to adapt to their specific downstream task,“ he told VentureBeat.
The challenge of transparent multimodal reasoning
Recent advances in reinforcement learning with verifiable rewards (RLVR) have significantly improved the reasoning abilities of large language models (LLMs). RLVR trains LLMs to generate chain-of-thought (CoT) tokens (which mimic the reasoning processes humans use) before generating the final answer. This improves the model’s capability to solve complex reasoning tasks such as math and coding.
Motivated by this success, researchers have applied similar RL-based methods to large multimodal models (LMMs), showing that the benefits can extend beyond text to improve visual understanding and problem-solving across different modalities.
However, a lack of transparency in the training pipeline has been a major barrier. Many studies on multimodal reasoning do not provide detailed information about their data curation and training processes, making it difficult to reproduce their results or understand what makes these models work.
“This lack of openness restricts reproducibility and obscures a deeper understanding of how reasoning-capable LMMs are actually built and how their training dynamics evolve,” the researchers note.
The OpenMMReasoner recipe
OpenMMReasoner addresses this gap with a fully transparent and scalable training recipe built on open-source LMMs. The researchers found it was critical to curate high-quality datasets by scaling data diversity. Although using diverse data sources is important, increasing the diversity of correct answers for the same question was an essential axis for improvement.
The first stage of the recipe is a three-step supervised fine-tuning (SFT) pipeline. It begins with data sourcing, where the team collected approximately 103,000 raw question-answer pairs from public datasets covering general visual Q&A and reasoning tasks. Next, they added a data distillation step, using a powerful model (Qwen3-VL-235B-Instruct) to generate new, high-quality reasoning traces for selected questions. (The data will then be used to train a smaller model.)
To increase answer diversity, the team generated multiple verified reasoning traces for each question. This expanded the dataset to 583,000 samples. Finally, they implemented a “domain mixing” phase, adding data from mathematical reasoning domains to further generalize the model’s capabilities, resulting in a final SFT dataset of 874,000 examples.
The second stage is an RL recipe that uses a smaller, 74,000-sample dataset curated from domains like science, math and puzzles. The model is trained with a composite reward function that considers both the correctness of the final answer and the consistency of the output format. To improve efficiency, the process includes a penalty for „overthinking,“ discouraging the model from generating excessively long answers (a problem with many reasoning models trained through RL, which mistakenly learn to generate overly long reasoning sequences, resulting in excess cost and slower answers).
This recipe can provide a blueprint for enterprises training their own models. „For companies with limited domain-specific data, a feasible strategy is to first increase answer diversity for their existing dataset, then use domain mixing to integrate this domain data into a general reasoning recipe like ours,“ Zhang explained. „This allows the model to acquire strong general-purpose reasoning skills while also adapting to industry-specific tasks, without needing millions of samples.“
A more efficient and capable reasoning model
According to Zhang, the step-by-step process fundamentally changes the reliability of the model’s outputs. „Traditional models often ‚jump‘ directly to an answer, which means they explore only a narrow portion of the reasoning space,“ he said. „In contrast, a reasoning-first approach forces the model to explicitly examine multiple intermediate steps… [allowing it] to traverse much deeper paths and arrive at answers with far more internal consistency.“
The researchers used the OpenMMReasoner recipe to generate data to fine-tune the Qwen2.5-VL-7B-Instruct open-source vision-language model. The result is a highly capable LMM that consistently outperforms state-of-the-art methods, such as Open Vision Reasoner (OVR), across a wide range of multimodal reasoning benchmarks. The SFT stage alone creates a strong baseline model that achieves superior performance and data efficiency compared to other SFT approaches, despite using a significantly smaller training dataset.
The subsequent RL phase further sharpens and stabilizes these abilities, leading to more consistent and improved performance. After RL, the final model achieves state-of-the-art results on several benchmarks, including WeMath, MathVerse and MathVista.
One of the key findings was that, as the model improved at multimodal reasoning, it also showed a „gradual emergence of textual reasoning behaviors, suggesting a transfer of reasoning competence from multimodal to purely linguistic domains,“ the researchers note. This indicates that skills learned in one modality can strengthen performance in another.
„Our results show that strengthening multimodal reasoning can even improve text-only mathematical skills—evidence that core logical abilities can transfer across modalities,“ Zhang said. „Looking ahead, we do expect these methods to extend to video and audio.“
The researchers also found that token efficiency is crucial. While allowing a model to generate longer reasoning steps can improve performance, excessive tokens reduce efficiency. Their results show that setting a smaller „reasoning budget“ can achieve comparable or even better accuracy, an important consideration for deploying cost-effective enterprise applications.
By open-sourcing all components of their workflow, the researchers provide a reproducible view of the entire process. For enterprise teams, this transparency is invaluable. „For business leaders concerned about vendor lock-in, hidden biases or opaque data sources, this level of transparency is essential,“ Zhang stated. „It empowers teams to validate the data, customize the pipeline for new domains and maintain long-term independence from any single provider.“