OpenAI Targets a Bigger Entry Point than ChatGPT
On May 7, 2026, OpenAI officially launched GPT-Realtime-2, a voice model that promises to change the rules of human-machine interaction. This model achieves a rapid response time of 1.12 seconds and retains subtle details from human conversations through its end-to-end architecture. This competition for voice interaction is reshaping service paradigms in customer service, healthcare, finance, and more.
Last week, I called an online education institution for course inquiries. The voice on the other end was female, speaking at a moderate pace with pauses and occasional fillers, reminiscent of a real customer service representative. She asked about my child’s grade and availability, saying, “Please share the details, and I will help you with recommendations.”
The conversation lasted about five minutes, and it flowed smoothly. After hanging up, I reflected on some details: the representative remembered that I mentioned my child’s struggles in math two minutes into the call and circled back to suggest a direction based on that information four minutes later.
I was surprised to discover that this was a customer service system integrated with a voice large model—human agents wouldn’t recall details so accurately or connect points over a five-minute conversation.
OpenAI’s release of GPT-Realtime-2 has sparked discussions about its capabilities in reasoning and multilingual translation. However, I argue that what OpenAI is truly after is a much larger entry point than ChatGPT.
What OpenAI Released
On May 7, 2026, OpenAI unveiled three models and transitioned the long-anticipated Realtime API from beta to a full release. The models have distinct roles, but GPT-Realtime-2 is the star.
- Context Window Expansion: The context window has increased from 32k to 128k. This isn’t just a trivial parameter; 32k can handle a 10-minute call, while 128k can accommodate a 40-minute meeting. This means the model can remember details from a conversation half an hour long.
- Five Levels of Reasoning: The model offers five reasoning levels, from minimal to very high. In the lowest mode, the first word of audio is produced in 1.12 seconds. The highest level takes a bit longer at 2.33 seconds but can handle more complex multi-step reasoning.
To put this in perspective, Siri takes an average of 1.8 seconds to respond after you finish speaking. OpenAI achieves this in 1.12 seconds, providing meaningful responses rather than saying, “I didn’t understand you.”
- Significant Benchmark Improvements: In the Big Bench Audio evaluation, GPT-Realtime-2 scored 82.8%, a significant leap from the previous generation’s 65.6%. In MultiChallenge, it jumped from 34.7% to 48.5%.
This isn’t just a regular upgrade; it’s a generational leap.
- Introduction of Preamble: A crucial new feature is the preamble. When the model needs to access tools, check data, or think, it will say something like, “Let me check that,” or “Please hold on while I look at the calendar.” Previous voice assistants either answered immediately or remained silent—silence is detrimental, as users might think the system has crashed. GPT-Realtime-2 engages with users, saying, “Hmm, let me think.”
Architectural Revolution
To understand the significance of this, we need to look at how voice AI has been constructed over the past decade. All previous voice assistants—Siri, Xiao Ai, Tmall Genie, and smart customer service—followed a similar three-stage pipeline:
- Loss of Emotion: Speech-to-text only focuses on “what you said,” ignoring “how you said it.” For example, saying “I’m fine” could mean genuinely fine or expressing frustration, but this emotional nuance is lost in transcription.
- Loss of Context: The large model only receives dry text without understanding pauses, sighs, or tremors in voice.
- Loss of Authenticity: The synthesized voice is mechanical and flat, sounding scripted. Everyone can tell it’s a machine.
This explains why all previous AI voice assistants sounded similar—not due to a lack of technology but because of the inherent limitations in their architecture.
GPT-Realtime-2 dismantles this architecture. It operates end-to-end—sound waves go in and come out directly, without a text translation layer in between. This means the model captures not just “what you said” but “how you said it.” It can pause to wait for you if you cough. This is why the education consultation call I experienced made me think it was a real person—not because the model was intelligent, but because the architecture had changed.
The Real Turnaround: Why OpenAI is Committed to This
If you view GPT-Realtime-2 merely as a “better voice API,” you miss the bigger picture. In the past 12 months, OpenAI’s iteration speed on text models has noticeably slowed. Following the release of GPT-5, public discussion has dramatically declined. The performance score increased by only 10% from GPT-4 to GPT-5, but users felt it was “about the same.”
So, what is OpenAI betting on now? Voice. This is not a gamble; it’s inevitable.
Consider the evolution of human-machine interaction:
Every shift in interaction form has eliminated the previous entry point and redistributed internet traffic. The current AI era still revolves around text input—typing to ChatGPT, Claude, or Doubao. However, typing to AI is a transitional phase.
For over a decade, major companies have tried to make voice the primary entry point, but all have failed. Apple’s Siri has been trying for fifteen years without success, Amazon’s Alexa burned through $25 billion, and Xiao Ai, Baidu, and Tmall Genie have devolved into mere “voice remote controls” that can only turn lights on and off.
Why did they fail? Because previous voice assistants were based on “commands”—you had to speak clearly and correctly for them to understand. GPT-Realtime-2 is based on “dialogue”—you can start a sentence, pause to think, and it will wait for you; you can change your mind mid-sentence, and it will keep up.
What Apple couldn’t achieve in fifteen years, OpenAI accomplished with a single API.
I predict that within 18 months, a product equivalent to a “voice version of ChatGPT” will emerge. It may not necessarily be developed by OpenAI but could come from a startup using the Realtime API. Its form will be simple—press a button, speak, and release to receive a response. Whoever creates this first will secure the next generation of entry.
Industry Transformation
Now, let’s look at the changes already happening on the enterprise side.
Real estate platform Zillow saw user connection success rates jump from 69% to 95% after integrating a voice large model, a 26 percentage point increase. Previously, with traditional voice menus, 3 out of 10 callers would hang up before completing their intent; now, 9.5 out of 10 can finish their requests. This 26% increase translates directly into sales, commissions, and revenue.
In the Chinese market, these numbers are similarly significant, if not more exaggerated. Think about your last experience with a bank’s customer service: waiting music for 30 seconds, a four-layer voice menu, being transferred to a human agent who asks for your ID number again. The entire process takes eight minutes to resolve a password change request.
With a voice large model, the interaction could be streamlined to: “I want to change my password,” followed by, “Please tell me your name and registered phone number,” completing the task in 30 seconds.
This isn’t a future scenario; several banks and insurance companies in China are already conducting proofs of concept in this direction. The only missing piece is the timeline for a sufficiently robust foundational model like GPT-Realtime-2 to be implemented in local scenarios.
All tasks that can be resolved with a single phone call are being rewritten. Customer service calls, sales outreach, front desk appointments, online education consultations, hospital appointment scheduling, and government hotline inquiries—any scenario that previously required a human to answer the phone can now be handled by a voice agent that is always available, patient, and online 24/7 at a very low cost.
More importantly, for all the small companies that want to serve customers but cannot afford a customer service team, this capability is now within reach. A small clinic can have its own 24-hour appointment assistant, a law firm can have its own initial consultation front desk, and a local business can have its own call center.
In the past decade, cloud computing made servers affordable for small companies; in the past five years, SaaS made sales tools accessible; and in the next five years, voice models will enable small companies to afford “24/7 receptionists.” This is a foundational infrastructure-level unlocking.
Conclusion
Returning to the education consultation call, the system still isn’t perfect and occasionally has slightly unnatural pauses. However, it was able to keep me uncertain until the end of the call about whether it was a machine.
After the release of GPT-Realtime-2, these “unnatural pauses” will disappear.
This shift is not just a technical detail; it’s a foundational infrastructure-level reshaping. The era of typing to AI has just begun, and the era of speaking to AI is set to launch from this point in May 2026.
Every entry point switch redistributes power and gives rise to new companies. The PC era empowered Google and Amazon; the mobile era favored Apple, ByteDance, and WeChat; the AI typing era is currently benefiting OpenAI, Anthropic, and ByteDoubao.
Who will benefit from the AI speaking era? It could be a small team just starting to develop a medical voice agent or a product that deepens the banking customer service chain, or it might be a new infrastructure that tools and platforms voice agents.
One thing is certain: this door has just opened.
Comments
Discussion is powered by Giscus (GitHub Discussions). Add
repo,repoID,category, andcategoryIDunder[params.comments.giscus]inhugo.tomlusing the values from the Giscus setup tool.