Meta's New AI Tech Can Edit Images with Text and Convert Them to Video

Building on their prior work in image and video generation, these models showcase impressive capabilities in high-quality, diffusion-based text-to-video generation and controlled image editing using text instructions.

Meta's New AI Tech Can Edit Images with Text and Convert Them to Video
Image Credit: Maginative

Meta AI has unveiled two major new research milestones that highlight their continued progress in generative AI—Emu Video for text-to-video generation, and Emu Edit for precise text-guided image editing.

Emu Video: High-Quality Text-to-Video Generation

Emu Video leverages Meta's Emu image generation model to achieve state-of-the-art results in text-to-video generation. The key innovation of Emu Video is its "factorized" approach, which splits text-to-video generation into two steps - first generating an image from the text prompt, and then generating a video conditioned on both the image and text.

This approach provides stronger conditioning signals to the model compared to prior text-only methods. Emu Video uses diffusion models and identifies critical adjustments like tailored noise schedules and multi-stage training to generate high-resolution 512x512 videos directly.

In human evaluations, Meta says that Emu Video strongly outperformed all prior text-to-video models, including Google's Imagen Video and NVIDIA's PYOCO, on both video quality and faithfulness to the text prompt. Emu Video was preferred 96% of the time over Meta's previous Make-A-Video method. The factorized approach also enables animation of user-provided images based on text prompts.

However, based on our own testing, we do find Emu Video to be as capable as products from startups like Runway and Pika labs.

Emu Edit: Precise Text-Guided Image Editing

Emu Edit demonstrates new capabilities in editing images purely based on textual instructions, executing both free-form and region-based edits precisely. The key to Emu Edit's precision is its training on both image editing tasks and computer vision tasks like segmentation and detection formulated as instructions.

This provides strong control signals, allowing Emu Edit to alter only relevant pixels and leave unrelated areas untouched according to instructions. Emu Edit was trained on a dataset of 10 million image triplets showing input, instructions, and target output. It significantly outperformed current instruction-based editing methods in human and automatic evaluations across diverse edit types.

According to Meta, these advances could enable creative applications like generating custom animated stickers, effortlessly editing personal photos, and more expressive visual communication. While not aiming to replace professional creatives, Meta believes these advancements could one day lead to more accessible creative tools, easier self-expression, and new means of visual communication for everyday users.

Let’s stay in touch. Get the latest AI news from Maginative in your inbox.