Connect with us

Artificial Intelligence

Google’s VideoPoet Multimodal Model Creates Both Video and Audio

Published

on

Google researchers introduced VideoPoet, a sophisticated language model capable of processing multimodal inputs, including text, images, videos, and audio, to produce videos. VideoPoet employs a decoder-only transformer architecture, which operates in a zero-shot manner, enabling it to generate content for tasks it hasn’t specifically trained on. The training process consists of two steps, mirroring the approach of large language models (LLMs): pretraining and task-specific adaptation. The pre-trained LLM serves as a versatile foundation that can be fine-tuned for various video generation tasks, as explained by the researchers.

In contrast to competing video models utilizing diffusion models, which introduce noise to training data and subsequently reconstruct it, VideoPoet consolidates numerous video generation capabilities into a unified large language model (LLM). Unlike models with separately trained components for specific tasks, VideoPoet seamlessly integrates various video generation functionalities.

Its capabilities encompass text-to-video, image-to-video, video stylization, video inpainting and outpainting, as well as video-to-audio generation. VideoPoet, an autoregressive model, generates output by referencing its previously generated content. It undergoes training in video, text, image, and audio, employing tokenizers to facilitate the conversion of input between various modalities.

“Our results suggest the promising potential of LLMs in the field of video generation,” the researchers said. “For future directions, our framework should be able to support ‘any-to-any’ generation, e.g., extending to text-to-audio, audio-to-video, and video captioning should be possible, among many others.”

Text to video
Text prompt: Two pandas playing cards

Image to video with text prompts
Text prompt accompanying the images (from left):

  1. A ship navigating the rough seas, thunderstorms and lightning, animated oil on canvas
  2. Flying through a nebula with many twinkling stars
  3. A wanderer on a cliff with a cane looking down at the swirling sea fog below on a windy day

Image (left) and video generated (immediate right)

Zero-shot video stylization
VideoPoet can modify a pre-existing video based on text prompts.

In the provided examples, the original video is on the left, while the stylized version is immediately adjacent to it. From left to right: A wombat wearing sunglasses and holding a beach ball on a sunny beach; teddy bears gracefully ice skating on a crystal clear frozen lake; a metal lion roaring in the radiant light of a forge.

Video to audio
Initially, the researchers created 2-second video clips, and VideoPoet autonomously predicted the corresponding audio without relying on any text prompts.

Moreover, VideoPoet can craft a brief film by assembling multiple short clips. The researchers initiated the process by requesting Bard, Google’s alternative to ChatGPT, to draft a short screenplay using prompts. Subsequently, they generated video content based on these prompts and amalgamated all elements to produce the final short film.

Longer videos, editing and camera motion
Google stated that VideoPoet addresses the challenge of generating longer videos by conditioning the last second of videos to predict the subsequent second. They explained, “By chaining this process repeatedly, we demonstrate that the model not only effectively extends the video but also maintains the visual fidelity of all objects consistently across multiple iterations.”

Additionally, VideoPoet possesses the ability to manipulate the movement of objects in existing videos. For instance, a video featuring the Mona Lisa can be prompted to showcase the act of yawning. Utilizing text prompts can also facilitate alterations in camera angles within pre-existing images.

To illustrate, the initial image was generated with the following prompt: “Adventure game concept art of a sunrise over a snowy mountain by a crystal clear river.”

Subsequently, additional prompts were applied in sequence from left to right: “Zoom out,” “Dolly zoom,” “Pan left,” “Arc shot,” “Crane shot,” and “FPV drone shot.”

Click to comment

You must be logged in to post a comment Login

Leave a Reply

Apps

Astra Tech Unveils Botim AI

Published

on

Astra Tech announced the integration of Botim AI into its Botim Ultra App, bringing advanced AI capabilities to the platform and enhancing accessibility for over 150 million global users. This makes Botim the first fintech in the region to introduce this innovation. Botim AI is a chat assistant designed to elevate user engagement and interaction, offering free, seamless access to cutting-edge features directly within the Ultra App. Users will benefit from intelligent, AI-driven conversations and assistance across various areas, including productivity, education, research, and everyday problem-solving.

H.E. Dr. Tariq Bin Hendi, Board Member and CEO of Astra Tech, commented: “The launch of Botim AI marks a significant milestone in our journey to revolutionize digital communication in the MENA region. By integrating advanced AI capabilities, we are enhancing user experiences and setting new standards for intelligent, seamless interactions. This innovation underscores our commitment to leveraging cutting-edge technology to meet the evolving needs of our users, while advancing our mission to deliver more inclusive solutions that empower individuals from all demographics and enable frictionless engagement with our solutions.”

The AI currently supports chat-based interactions, with enhancements such as web search capabilities and action-based integration planned for future phases. As part of its strategic evolution, Botim is advancing toward the deployment of executional AI, enabling users to complete tasks in their native language with ease. This innovation meets the growing demand for inclusive AI solutions that ensure seamless, accessible interactions, empowering individuals from all backgrounds to engage effortlessly with the Ultra App.

Botim AI represents a significant step forward in the evolution of communication apps in the MENA region, offering users a smarter, more interactive way to connect and engage. Easily accessible from the landing page, Explore, and Search sections of the app, Botim AI allows current users to simply update to the latest version to access these new features directly on the landing page. Additionally, users can securely save chat history for future reference and have the option to delete past conversations for privacy and control.

Continue Reading

Artificial Intelligence

Microsoft Intros New Surface Copilot+ PCs for Business

Published

on

Microsoft has launched its latest Surface for Business Copilot+ PCs in the Middle East, featuring the powerful Intel Core Ultra (Series 2) processors. This new lineup, including the Surface Laptop, Surface Pro, and the upcoming Surface Laptop with 5G, is designed to boost productivity and AI innovation for professionals in the region.

Check out what Microsoft 365 Copilot can do for you:

Engineered for exceptional performance, enhanced security, and seamless connectivity, these PCs offer unparalleled speed and efficiency for demanding tasks like multitasking, data analysis, and creative work. The sleek Surface Laptop provides a reliable and stylish solution with a high-resolution display and long battery life. The versatile Surface Pro offers tablet flexibility with laptop power, ideal for professionals on the go. For constant connectivity, the Surface Laptop with 5G will provide lightning-fast internet speeds, ensuring productivity anywhere.

“We are excited to introduce the new lineup of our Surface for Business Copilot+ PCs, which are designed to empower professionals across the Middle East to thrive in today’s highly dynamic business landscape,” said Zubin Chagpar, Senior Director and Business Group Leader for Modern Work & Surface Devices at Microsoft CEMA. “In today’s fast-evolving, AI-powered work environment, businesses seek tools that not only boost productivity but also spark creativity and innovation. With cutting-edge performance, enhanced AI capabilities, and seamless connectivity, this new lineup empowers professionals to excel in their work, no matter where they are. At Microsoft, we remain deeply committed to empowering every professional across the workforce to achieve more, and the Surface for Business Copilot+ PCs are a testament to this mission.”

These Copilot+ PCs are Microsoft’s most secure Windows devices yet, featuring the Microsoft Pluton security processor enabled by default. This chip-to-cloud security technology, embedded in the CPU by silicon partners, provides a secure vault for sensitive data like passwords and encryption keys, protecting against physical and remote attacks.

Alongside the new PCs, Microsoft is launching several innovations. The new Surface USB4 Dock offers enhanced connectivity with faster data transfer and multi-display support. Microsoft Teams Rooms on Surface Hub 3 are improved for seamless meeting collaboration. Finally, the Security Copilot in the Surface Management Portal provides advanced security features and insights for efficient Surface device management and protection by IT administrators.

The new Surface for Business Copilot+ PCs are now available for pre-order. The Microsoft Surface Laptop for Business with Intel Core Ultra processors (Series 2) has been available for delivery in the Middle East market since March 4, 2025, and the Microsoft Surface Pro for Business with Intel Core Ultra processors (Series 2) can be pre-ordered now for delivery starting March 24, 2025.

Continue Reading

Apps

Google’s Latest AI Model Enables Watermark Removal from Images

Published

on

A potentially controversial application of Google’s new Gemini 2.0 Flash AI model has emerged: users are leveraging it to remove watermarks from images, including those from stock photo sites such as Getty Images.

The recently expanded image generation feature of Gemini 2.0 Flash allows for native image generation and editing, a powerful tool that seemingly lacks robust usage restrictions. Social media users have highlighted how the AI can not only remove watermarks but also intelligently fill in the resulting gaps, often with impressive accuracy, and it’s currently free within Google’s AI Studio developer tools.

While labeled “experimental” and “not for production use,” Gemini 2.0 Flash’s ability to bypass watermarks stands in contrast to models like Anthropic’s Claude 3.7 Sonnet and OpenAI’s GPT-4o, which explicitly prohibit such actions, citing ethical and legal concerns.

It’s important to note that Gemini 2.0 Flash isn’t foolproof; it can struggle with semi-transparent or heavily overlaid watermarks. Nevertheless, the ease with which it can remove watermarks raises potential copyright issues, as removing a watermark without the copyright holder’s permission is generally illegal in many countries. This situation underscores the ongoing challenges of balancing powerful AI capabilities with copyright protection.

Continue Reading
Advertisement
Advertisement

Latest Reviews

Follow us on Facebook