ChatGPT Can Now Speak, Listen, and Analyze Images

This week in AI, OpenAI announced that ChatGPT will become multimodal, meaning it can now speak, listen, and analyze images.

7 months ago • 3 min read

By Mike Sak

This week in AI, OpenAI announced a few major updates to ChatGPT. The world's most famous LLM is becoming multimodal, in other words it will be able to listen to spoken prompts and provide answers directly via a synthetic voice. The update is a major one for OpenAI and is the largest enhancement since the company released its latest GPT-4 platform earlier this year.

The addition of these senses for ChatGPT will be reminiscent of other AI-based assistants like Amazon's Alex and Apple's Siri. Upon the release, the speech and listening capabilities will only be available on ChatGPT's mobile app for both iOS and Android users. The voice processing capabilities can also respond to conversations between people, narrate stories, and are even being used by Spotify to translate podcasts into multiple different languages.

The other major upgrade for ChatGPT is that the platform will be able to identify and process images. Not only will users be able to share photos with ChatGPT similar to the Google Lens functionality, but they will also be able to action these images. For example, providing a photo of the Eiffel Tower will prompt ChatGPT to provide you with a description and history of the landmark. But in the latest version, users can provide an image of a piece of software or blueprints and it will actually produce the code for it.

Check out Mckay Wrigley's posts on the capabilities of ChatGPT's new image processor:

I gave ChatGPT a screenshot of a SaaS dashboard and it wrote the code for it.

This is the future. pic.twitter.com/9xFgFdv4MM
— Mckay Wrigley (@mckaywrigley) September 27, 2023

And another interesting example...

You can give ChatGPT a picture of your team’s whiteboarding session and have it write the code for you.

This is absolutely insane. pic.twitter.com/bGWT5bU8MK
— Mckay Wrigley (@mckaywrigley) September 27, 2023

The image processing capabilities will be available to paid users of ChatGPT and can be used with both the mobile and desktop versions.

This ChatGPT update comes at an opportune time as anticipation builds for Google's rumoured multimodal release of its Gemini AI platform, as well as other offerings from Meta Platforms and Microsoft.

It also comes on the heels of Amazon announcing a major investment into OpenAI rival Anthropic. This deal will see Amazon invest upwards of $4 billion into Anthropic's AI platform, as well as taking a minority stake in the company. Anthropic will also be exclusively using Amazon's AWS platform as its primary cloud provider.

On Wednesday, Meta Platforms introduced its AI-powered Ray Ban smart glasses at its Meta Connect developer conference. These glasses will be able to analyze anything that the users are looking at, including examples like identifying buildings and providing instructions for fixing a leaky faucet. The glasses also come with a 12 MP camera and can live stream everything that you see directly to social media.

All of this is to say that the AI race is certainly heating up and the technology is advancing at a much faster rate than anyone could have imagined. The addition of speaking and listening for ChatGPT could be a game changer...let's see how it's received over the next few weeks.