Developer launches Emu3 multimodal model unifying video, image, text

English Español Français العربية Русский

RSS Newsletters

Radio TV

LANGUAGE
English Español Français العربية Русский Documentary CCTV+

Radio TV

By continuing to browse our site you agree to our use of cookies, revised Privacy Policy and Terms of Use. You can change your cookie settings through your browser.

I agree

The Beijing Academy of Artificial Intelligence (BAAI) on Monday released Emu3, a multimodal world model that unifies the understanding and generation of text, image and video modalities through next-token prediction.

Emu3 successfully validates that next-token prediction can serve as a powerful paradigm for multimodal models, scaling beyond language models and delivering state-of-the-art performance across multimodal tasks, said Wang Zhongyuan, director of BAAI, in a press release.

"By tokenizing images, text and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences," Wang said, adding that Emu3 eliminates the need for diffusion or compositional approaches entirely.

According to BAAI, Emu3 outperforms several well-established task-specific models in both generation and perception tasks. The organization has open-sourced the key technologies and models of Emu3 to the international technology community.

Technology practitioners have noted that a new opportunity has emerged to explore multimodality through a unified architecture, eliminating the need to combine complex diffusion models with large language models.

"In the future, the multimodal world model will promote scenario applications such as robot brains, autonomous driving, multimodal dialogue and inference," Wang said.

(Cover via CFP)

Source(s): Xinhua News Agency