Artificial IntelligenceNews

Google’s VideoPoet: A Multimodal Model Generating Video and Audio


Google researchers introduced VideoPoet, a sophisticated language model capable of processing multimodal inputs, including text, images, videos, and audio, to produce videos. VideoPoet employs a decoder-only transformer architecture, which operates in a zero-shot manner, enabling it to generate content for tasks it hasn’t specifically trained on. The training process consists of two steps, mirroring the approach of large language models (LLMs): pretraining and task-specific adaptation. The pre-trained LLM serves as a versatile foundation that can be fine-tuned for various video generation tasks, as explained by the researchers.

In contrast to competing video models utilizing diffusion models, which introduce noise to training data and subsequently reconstruct it, VideoPoet consolidates numerous video generation capabilities into a unified large language model (LLM). Unlike models with separately trained components for specific tasks, VideoPoet seamlessly integrates various video generation functionalities.

Its capabilities encompass text-to-video, image-to-video, video stylization, video inpainting and outpainting, as well as video-to-audio generation. VideoPoet, an autoregressive model, generates output by referencing its previously generated content. It undergoes training in video, text, image, and audio, employing tokenizers to facilitate the conversion of input between various modalities.

“Our results suggest the promising potential of LLMs in the field of video generation,” the researchers said. “For future directions, our framework should be able to support ‘any-to-any’ generation, e.g., extending to text-to-audio, audio-to-video, and video captioning should be possible, among many others.”

Text to video
Text prompt: Two pandas playing cards

Image to video with text prompts
Text prompt accompanying the images (from left):

  1. A ship navigating the rough seas, thunderstorms and lightning, animated oil on canvas
  2. Flying through a nebula with many twinkling stars
  3. A wanderer on a cliff with a cane looking down at the swirling sea fog below on a windy day

Image (left) and video generated (immediate right)

Zero-shot video stylization
VideoPoet can modify a pre-existing video based on text prompts.

In the provided examples, the original video is on the left, while the stylized version is immediately adjacent to it. From left to right: A wombat wearing sunglasses and holding a beach ball on a sunny beach; teddy bears gracefully ice skating on a crystal clear frozen lake; a metal lion roaring in the radiant light of a forge.

Video to audio
Initially, the researchers created 2-second video clips, and VideoPoet autonomously predicted the corresponding audio without relying on any text prompts.

Moreover, VideoPoet can craft a brief film by assembling multiple short clips. The researchers initiated the process by requesting Bard, Google’s alternative to ChatGPT, to draft a short screenplay using prompts. Subsequently, they generated video content based on these prompts and amalgamated all elements to produce the final short film.

Longer videos, editing and camera motion
Google stated that VideoPoet addresses the challenge of generating longer videos by conditioning the last second of videos to predict the subsequent second. They explained, “By chaining this process repeatedly, we demonstrate that the model not only effectively extends the video but also maintains the visual fidelity of all objects consistently across multiple iterations.”

Additionally, VideoPoet possesses the ability to manipulate the movement of objects in existing videos. For instance, a video featuring the Mona Lisa can be prompted to showcase the act of yawning. Utilizing text prompts can also facilitate alterations in camera angles within pre-existing images.

To illustrate, the initial image was generated with the following prompt: “Adventure game concept art of a sunrise over a snowy mountain by a crystal clear river.”

Subsequently, additional prompts were applied in sequence from left to right: “Zoom out,” “Dolly zoom,” “Pan left,” “Arc shot,” “Crane shot,” and “FPV drone shot.”

Prarthana Mary

IT Teams Should be Vocal About Environmental Sustainability

Previous article

Salam Claims to Have Taken the Lead in Saudi Arabia’s Fixed Broadband Race

Next article

You may also like


Comments are closed.