Meta Research

Meta's New AI Tech Can Edit Images with Text and Convert Them to Video

November 16, 2023 • 2 min read

Meta AI has unveiled two major new research milestones that highlight their continued progress in generative AI—Emu Video for text-to-video generation, and Emu Edit for precise text-guided image editing.

Emu Video: High-Quality Text-to-Video Generation

Emu Video leverages Meta's Emu image generation model to achieve state-of-the-art results in text-to-video generation. The key innovation of Emu Video is its "factorized" approach, which splits text-to-video generation into two steps - first generating an image from the text prompt, and then generating a video conditioned on both the image and text.

This approach provides stronger conditioning signals to the model compared to prior text-only methods. Emu Video uses diffusion models and identifies critical adjustments like tailored noise schedules and multi-stage training to generate high-resolution 512x512 videos directly.

In human evaluations, Meta says that Emu Video strongly outperformed all prior text-to-video models, including Google's Imagen Video and NVIDIA's PYOCO, on both video quality and faithfulness to the text prompt. Emu Video was preferred 96% of the time over Meta's previous Make-A-Video method. The factorized approach also enables animation of user-provided images based on text prompts.

However, based on our own testing, we do find Emu Video to be as capable as products from startups like Runway and Pika labs.

Emu Edit: Precise Text-Guided Image Editing

Emu Edit demonstrates new capabilities in editing images purely based on textual instructions, executing both free-form and region-based edits precisely. The key to Emu Edit's precision is its training on both image editing tasks and computer vision tasks like segmentation and detection formulated as instructions.

This provides strong control signals, allowing Emu Edit to alter only relevant pixels and leave unrelated areas untouched according to instructions. Emu Edit was trained on a dataset of 10 million image triplets showing input, instructions, and target output. It significantly outperformed current instruction-based editing methods in human and automatic evaluations across diverse edit types.

According to Meta, these advances could enable creative applications like generating custom animated stickers, effortlessly editing personal photos, and more expressive visual communication. While not aiming to replace professional creatives, Meta believes these advancements could one day lead to more accessible creative tools, easier self-expression, and new means of visual communication for everyday users.

Chris McKay is the founder and chief editor of Maginative. His thought leadership in AI literacy and strategic AI adoption has been recognized by top academic institutions, media, and global brands.

An Exclusive Leadership Retreat

Leading in the Intelligence Age

Meta's New AI Tech Can Edit Images with Text and Convert Them to Video

Emu Video: High-Quality Text-to-Video Generation

Emu Edit: Precise Text-Guided Image Editing