What you need to know
- OpenAI just released their new flagship GPT-4o model.
- Interacting with ChatGPT becomes more seamless with the ability to infer audio, vision, and text in real time.
- OpenAI also announced a native ChatGPT app for Mac, leaving Windows behind.
- The viral ChatGPT demo showcased GPT-4o's audio and visual capabilities for communicating with another AI model.
OpenAI just announced their new flagship GPT-4o model (I'm sure I'm not the only one confused as these models continue to ship). Essentially, GPT-4o is an improved version of OpenAI's GPT-4, and it's just as smart. This model is more intuitive and can infer audio, vision, and text in real time, making interacting with ChatGPT more seamless.
While the “magic” behind OpenAI's just-concluded Spring Update event is still up for debate, the demos that have surfaced on social media are pretty impressive and border on amazing. Translating Italian into English and relaying information in real-time can be extremely difficult and hold at bay communication obstacles such as language barriers.
But what baffled me was the video demo shared by the following user: Comments from OpenAI President and Co-Founder Greg Brockman on X (Formerly Twitter). I never thought that one day a virtual assistant would be able to have a full conversation with another AI assistant with minimal complexity.
GPT-4o is a new model that can infer text, audio, and video in real time. It's extremely versatile, fun to play with, and a step toward more natural forms of human-computer interaction (and even human-to-human interaction). -Computer-Computer Interaction): pic.twitter.com/VLG7TJ1JQxMay 13, 2024
The demo begins with a user explaining to two AI chatbots that they will essentially be talking to each other. The user explains his expectations to the chatbot by stating that his one of them can see the world through a camera. In contrast, chatbots can also model questions and instruct users to perform specific tasks with their assistance.
“Well, well, just when you thought it couldn’t get any more interesting,” the first chatbot jokingly replied. Talking to another AI that can see the world sounds like a plot twist in the AI world. ” Just before the AI assistant agreed to the terms, the user asked the assistant to pause for a moment while she gave instructions to her second AI. .
The user immediately starts a conversation by telling the second AI assistant that they have access to see the world. I think this is a subtle prompt asking the assistant to access your phone's camera. The eyes that see the world. Right off the bat, the interface is equipped with a camera (selfie mode) that draws a very clear image of what you're wearing and your environment.
From this point on, the user points out that the first AI model will talk to them and ask them questions like moving the camera and what it can see. We hope you find it helpful and your questions will be answered accurately.

The process begins with an AI that can “see the world” and describe what it sees, including detailed context about the user and their dress code or building design. Interestingly, the first AI provides feedback based on the information shared, so it feels as if he's two humans talking on his FaceTime. What's more, the AI seems to have a good idea of what you're doing, your facial expressions, and even your style based on what you're wearing.
What blew my mind was when the user signaled another person in the room to approach and appear in the AI's view. The AI recognized this instantly and even indicated that the user “might be preparing for a presentation or conversation” based on the user's direct contact with the camera.
Interestingly, the introduction of a third party did not affect the conversation between both AIs. At first glance, it's safe to say that the AI didn't catch a glimpse of the person walking into the room and standing behind the user holding the phone.
But that's not the case. The user momentarily stopped the conversation between both AIs to ask if anything unusual had happened. The vision-enabled AI noted that a second person appeared behind one person, playfully made rabbit eyes behind the first person, and quickly left the frame. The AI described the situation as follows. hilarious and unexpected.
The demo continues to showcase GPT-4o's vast capabilities. Users may even ask both models to create a song based on what just happened and take turns singing the lines. At one point, the choir director appears to be preparing the choir for an important upcoming event at the church.
I should also point out that most of the demos I've seen so far have been using Apple devices like iPhones and MacBooks. Perhaps this is why he released a native ChatGPT app for Mac users before OpenAI shipped to Windows. Additionally, OpenAI CEO Sam Altman acknowledged that “the iPhone is the best technology ever created by humans.”
