OpenAI introduced their new text-to-video model Sora. In this guide, we'll look at what it is, how it was built, and more.

By Peter Foy
There are moments in AI where you are utterly amazed by how fast the industry is moving, and today is one of them. OpenAI just introduced their text-to-video model, Sora, and the results are mind blowing.

While the general public still doesn't have access to Sora at the time of writing, Sam Altman has been sharing unedited videos on X, and it's already hard to imagine where we'll be in 12 months...

Sora vs. Will Smith eating spaghetti

Just to provide some context about where text-to-video was less than a year ago, here's the classic (and somewhat disturbing) AI-generated Will Smith eating Spaghetti video...

Now, comparing that with OpenAI's new Sora model, you can see how far the industry has come, and how this model is truly a breakthrough...

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

What is Sora?

Sora is a text-to-video model that can produce videos up to a minute long with an incredibly high level of visual quality and adherence to the user’s prompt.

Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background.

Currently, the model is only available to select users and red teamers, both to prepare the film & design industry for what's coming, and also to assess the model for potential risks.

What stands out most about the Sora model is it's ability to understand and simulate the physical world in motion, marking a significant step towards AI models that can interact with real-world scenarios and solve problems requiring an understanding of physical dynamics and aesthetics.

What can Sora do?

So far, the capabilities of Sora seem to be vast and varied. From creating scenes of stylish individuals navigating the bustling, neon-lit streets of Tokyo to generating footage of prehistoric wooly mammoths treading through snowy landscapes, Sora's range is impressive.

It can produce content across genres, including historical reenactments, futuristic cyberpunk narratives, and photorealistic nature documentaries. This versatility will undoubtedly make Sora a valuable tool for filmmakers, visual artists, designers, and marketers looking to bring their imaginative visions to life...

How was Sora built?

Sora is a state-of-the-art diffusion model that's designed to transform videos from an initial state of static-like noise into clear, coherent visual narratives through a series of refining steps.

Here are a few of the details around the research techniques OpenAI used to build Sora:

  • It uses a transformer architecture akin to that used in GPT models, although this model treats videos and images as collections of patches, analogous to tokens in language models.
  • This approach enables a unified approach to training on diverse visual data with different durations, resolutions, and aspect ratios.
  • By incorporating techniques such as recaptioning from DALL·E 3, Sora gains the ability to generate videos that closely follow textual instructions, showcasing its versatility in creating content from text prompts or enhancing existing images and videos.

This approach not only broadens the scope of visual content generation but also lays the groundwork for models capable of simulating real-world phenomena, marking a significant stride towards the development of Artificial General Intelligence (AGI).

Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

You can check out the full technical report of Sora here.

Sora vs. Runway

Now, the real question many AI startups did OpenAI just kill with Sora?

A few main competitors in the text-to-video space include Runway, Pika Labs, Stable Video, and more. Arguably the most notable competitor in the text-to-video space is Runway's Gen 2, although as you can see below from Runway, Sora is significantly more impressive:

Sora Video Examples

Alright enough writing, let's look at a few more examples of what Sora can do.

Prompt: Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.
Prompt: An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history of the universe as he sits at a cafe in Paris, his eyes focus on people offscreen as they walk as he sits mostly motionless, he is dressed in a wool coat suit coat with a button-down shirt , he wears a brown beret and glasses and has a very professorial appearance, and the end he offers a subtle closed-mouth smile as if he found the answer to the mystery of life, the lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.
Prompt: The story of a robot’s life in a cyberpunk setting.

Summary: Sora Text to Video

If you've been on Twitter in the last day, you know that Sora has taken the AI world by storm. It makes you think how in a few short months it will be very hard to tell which videos on your feed are real or AI-generated.

Sora is clearly a major breakthrough in AI and offers unprecedented capabilities in text-to-video generation. As OpenAI continues to refine Sora and roll out access to the more users, the possibilities for what's coming are truly boundless...

