Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction–image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law–constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio–visual consistency filtering to generate high-fidelity video–instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.

💡 Research Summary

OmniFysics tackles a fundamental weakness of current omni‑modal large language models (OLMs): their brittle physical understanding. While models such as GPT‑4o and Gemini 3 Pro excel at semantic alignment across text, images, audio, and video, they often generate “physical hallucinations” because key physical attributes (density, elasticity, friction, etc.) are visually ambiguous and sparsely represented in web‑scale data. To close this gap, the authors introduce a compact omni‑modal model, OmniFysics, together with a dedicated physical data engine that automatically creates high‑quality, physics‑grounded training data at scale.

The data engine consists of two pipelines. FysicsAny handles static objects. It follows a five‑stage process: (1) hybrid sampling of image collections, (2) visual‑language mapping to extract salient objects, (3) hierarchical retrieval of candidate physical attributes from a curated prototype database using Qwen‑3 embeddings, (4) physics‑law constrained verification (e.g., restitution ∈

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

💡 Research Summary

Comments & Academic Discussion

Leave a Comment