TL;DR: FlashWorld enables fast (7 seconds on a 1x A100/A800 GPU, 4 seconds on 1x H100/H800 GPU) and high-quality 3D scene generation across diverse scenes, from a single image or text prompt.
Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on ...