Introduction of OpenAI FM

Audio and speech

Explore audio and speech features in the OpenAI API.

Copy page

The OpenAI API provides a range of audio capabilities. If you know what you want to build, find your use case below to get started. If you're not sure where to start, read this page as an overview.

A tour of audio use cases

LLMs can process audio by using sound as input, creating sound as output, or both. OpenAI has several API endpoints that help you build audio applications or voice agents.

Voice agents

Voice agents understand audio to handle tasks and respond back in natural language. There are two main ways to approach voice agents: either with speech-to-speech models and the Realtime API, or by chaining together a speech-to-text model, a text language model to process the request, and a text-to-speech model to respond. Speech-to-speech is lower latency and more natural, but chaining together a voice agent is a reliable way to extend a text-based agent into a voice agent.

Streaming audio

Process audio in real time to build voice agents and other low-latency applications, including transcription use cases. You can stream audio in and out of a model with the Realtime API. Our advanced speech models provide automatic speech recognition for improved accuracy, low-latency interactions, and multilingual support.

Text to speech

For turning text into speech, use the Audio API audio/speech endpoint. Models compatible with this endpoint are gpt-4o-mini-tts, tts-1, and tts-1-hd. With gpt-4o-mini-tts, you can ask the model to speak a certain way or with a certain tone of voice.

Speech to text

For speech to text, use the Audio API audio/transcriptions endpoint. Models compatible with this endpoint are gpt-4o-transcribe, gpt-4o-mini-transcribe, and whisper-1. With streaming, you can continuously pass in audio and get a continuous stream of text back.

Summary and Review:

The OpenAI API offers powerful audio and speech capabilities, allowing developers to integrate AI tools into various applications. AI models can process and generate audio, enabling functionalities such as voice agents, real-time streaming, text-to-speech, and speech-to-text conversions.

Voice agents can be built using speech-to-speech AI models or by combining multiple AI tools like speech recognition and language processing. The Realtime API supports streaming audio, ensuring low-latency interactions. Additionally, AI models like gpt-4o-mini-tts and whisper-1 enhance speech synthesis and transcription, making AI-driven audio applications more natural and efficient.

Learn more about AI tools - OpenAI FM

AI News

View All News →

AI Image Generation Tools: The Creative Revolution of the Digital Era

July 6, 2025

AI Image Generation Tools: The Creative Revolution of the Digital Era

In this digital age, AI image generation tools are sparking an unprecedented creative revolution.Today, we delve into the mechanics of AI image generation, explore its underlying technology, and guide you in creating your own AI artwork!

The Best AI Image Generator in 2025: A Complete Guide

April 14, 2025

The Best AI Image Generator in 2025: A Complete Guide

This comprehensive guide explores the best AI image generators available in 2025, highlighting their unique features, strengths, and ideal use cases. We'll dive deep into each tool's capabilities to help you identify which AI image generator best suits your creative needs.

Firebase Studio,Google has just launched a powerful AI IDE code editor,Vibe Coding

April 10, 2025

Firebase Studio,Google has just launched a powerful AI IDE code editor,Vibe Coding

Google has just launched a powerful AI IDE code editor — now you can really vibe while coding

March 27, 2025

Best Claude 3.5 style for Code

Discover my firsthand experience using with various MCP servers for design and development. Learn how these powerful tools transformed my workflow and improved productivity across different roles.