news

You Can Now "Prompt" Your Speech-to-Text AI Like ChatGPT — Here's What Changes

AssemblyAI's new Universal-3 Pro model lets you guide transcriptions using plain English instructions. Discover how prompt-based control fixes misspelled names, redacts PII, and tags audio events instantly.

FreeVoice Reader Team
FreeVoice Reader Team
#Speech-to-Text#Voice AI#AssemblyAI

TL;DR: AssemblyAI has released Universal-3 Pro, a new speech-to-text model that lets you use plain English prompts to control how it transcribes audio. Instead of manually fixing misspelled names or building complex post-processing pipelines, users can now instruct the AI to catch specific jargon, redact sensitive info, and tag sounds like laughter or hold music in real-time.

If you use voice-to-text tools daily—whether for drafting emails, transcribing meetings, or generating clinical notes—you know the familiar sting of the "cleanup phase." You dictate a flawless paragraph, only for the AI to completely butcher your client's name, misspell a niche industry term, or fail to capture the nuance of a stutter.

Until now, fixing these issues required developers to either rely on rigid "word boosting" lists or spend thousands of dollars fine-tuning custom AI models. But the landscape of voice AI is shifting.

According to a recent report from TipRanks, AssemblyAI has introduced a paradigm shift in speech-to-text technology with Universal-3 Pro. By treating audio models more like Large Language Models (LLMs), they are giving users unprecedented, prompt-based control over how their audio is transcribed. Here is what this means for your daily voice workflows.

The Shift to "SpeechLLMs"

Historically, speech-to-text (STT) models were incredibly rigid. They were trained to listen to sounds and output the most statistically likely word. If they didn't know a word, they guessed.

Universal-3 Pro is built on a different architecture known as a Speech-augmented Large Language Model (SpeechLLM). It combines a high-fidelity audio encoder with an LLM decoder. This means the AI doesn't just hear phonetic sounds; it understands context and can follow instructions.

Instead of just handing the AI an audio file and hoping for the best, you can now tell the AI how to listen.

What You Can Do Now That You Couldn't Before

For daily users of voice AI, this update effectively eliminates the most frustrating bottlenecks in transcription. The update introduces granular control through two main features: the prompt and keyterms_prompt parameters.

1. Fix Names and Jargon instantly

The new model supports up to 1,000 specific key terms. If you are a lawyer dealing with a specific case, or a doctor dictating notes with complex pharmaceutical names, you can feed those terms to the AI before you start speaking. AssemblyAI reports up to a 45% accuracy improvement on domain-specific terms using this feature, meaning you spend drastically less time manually correcting transcripts.

2. Choose Between Verbatim and Clean Output

Previously, you had to choose an AI model based on whether you wanted a readable summary or a legal-grade verbatim transcript. Now, you can provide up to 1,500 words of plain-language instructions.

You can literally prompt the AI: "Transcribe this medical appointment verbatim, including all 'ums' and 'uhs,' and label the speakers as 'Doctor' and 'Patient'."

3. Redact Sensitive Information on the Fly

For users in finance, HR, or healthcare, compliance is a massive headache. Universal-3 Pro allows you to explicitly prompt the AI to redact Personally Identifiable Information (PII) like social security numbers or credit card details directly during the transcription process.

4. Tag Audio Events and Non-Speech Sounds

The model has been trained to recognize 50+ audio event tags. It doesn't just transcribe words; it labels the environment. It can identify [laughter], [music], [beep], [silence], and even [hold music]. For customer service analytics or AI voice agents, knowing when a customer laughed or when there was an awkward silence is just as important as what was said.

Implications Across Your Devices

While AssemblyAI is a cloud-based API geared toward developers, the ripple effects of Universal-3 Pro will be felt across the apps you use on Mac, iOS, Android, and the web.

  • Mac and iOS: AssemblyAI maintains a robust Swift SDK. This means developers building native macOS and iOS apps for journalists, researchers, and lawyers can easily integrate these features. Expect to see third-party dictation apps on your iPhone offering "Custom Vocabulary" profiles that sync seamlessly without draining your battery with heavy local processing.
  • Android and Web: Voice agents and floating voice overlays will become dramatically more conversational. By understanding cues like [silence] or [throat clearing], AI agents can better manage "turn-taking" in a conversation, preventing the AI from interrupting you while you're thinking.
  • Native Code-Switching: If you speak in "Spanglish" or frequently switch between languages, Universal-3 Pro natively supports six languages (English, Spanish, French, German, Italian, and Portuguese) in a single stream without needing to be told to switch.

How It Compares to the Competition

The voice AI market is highly competitive, but companies are taking different approaches:

  • OpenAI Whisper: Whisper remains the industry standard for open-source STT. However, it lacks the native "instruction-following" and real-time streaming capabilities found in Universal-3 Pro.
  • Deepgram Nova-3: Deepgram focuses heavily on ultra-low latency and "speed-to-cost" efficiency. While great for real-time applications, it doesn't yet match the deep natural language prompting capabilities of AssemblyAI.
  • Google Gemini: While Gemini can process audio natively, it is primarily an LLM that can transcribe. Universal-3 Pro is a dedicated transcription model built specifically to reason about speech.

The Cost and Privacy Angle

AssemblyAI is positioning itself as a premium, highly configurable layer in the AI stack. Pricing sits around $0.21/hour for asynchronous transcription and $0.45/hour for streaming.

However, it's vital to remember that AssemblyAI is a cloud-based API. To leverage this incredible prompt-based reasoning, your audio must be sent to their servers. For businesses, the efficiency gains and reduction in manual labor make this a no-brainer. But for individual users dictating highly sensitive journals, unreleased IP, or confidential client notes, sending audio to the cloud remains a privacy concern.

This highlights the ongoing industry divide: the sheer power and contextual reasoning of cloud-based SpeechLLMs versus the absolute security and zero-latency of local, on-device AI.

As voice technology continues to evolve, the ability to "prompt" our speech models will soon become the baseline expectation. We are rapidly moving away from AI that simply hears us, to AI that actually listens.


About FreeVoice Reader

FreeVoice Reader is a privacy-first voice AI suite that runs 100% locally on your device:

  • Mac App - Lightning-fast dictation, natural TTS, voice cloning, meeting transcription
  • iOS App - Custom keyboard for voice typing in any app
  • Android App - Floating voice overlay with custom commands
  • Web App - 900+ premium TTS voices in your browser

One-time purchase. No subscriptions. Your voice never leaves your device.

Try FreeVoice Reader →

Transparency Notice: This article was written by AI, reviewed by humans. We fact-check all content for accuracy and ensure it provides genuine value to our readers.

Try Free Voice Reader for Mac

Experience lightning-fast, on-device speech technology with our Mac app. 100% private, no ongoing costs.

  • Fast Dictation - Type with your voice
  • Read Aloud - Listen to any text
  • Agent Mode - AI-powered processing
  • 100% Local - Private, no subscription

Related Articles

Found this article helpful? Share it with others!