news

Your Voice Agents Just Got Eyes: What ElevenLabs' Multimodal Update Means for Developers

ElevenLabs just gave its voice agents the ability to "see" images and PDFs during real-time calls. Here's how the new multimodal support and scoped conversation analysis will change how you build and debug voice apps.

FreeVoice Reader Team
FreeVoice Reader Team
#ElevenLabs#Product Update#Voice Agents

TL;DR:

  • Multimodal Support: Voice agents can now process images and PDFs mid-conversation using the new sendMultimodalMessage function in the JS SDK.
  • Scoped Conversation Analysis: Debugging multi-agent workflows is now drastically easier, allowing you to isolate metrics for specific sub-agents rather than parsing entire call transcripts.
  • Apple Ecosystem Upgrades: The new Swift SDK v3.1.2 brings ultra-low latency LiveKit WebRTC and reactive SwiftUI integration for Mac and iOS developers.
  • Workflow Overrides: Developers can now restrict specific agents to distinct tool_ids and knowledge_base documents to prevent hallucinations.

If you build or use voice AI tools daily, you already know the frustration of a "blind" voice agent. Imagine a user trying to read a 16-character router serial number aloud to a support bot, or spelling out a complex foreign address. It's a massive friction point that text-to-speech (TTS) and speech-to-text (STT) alone simply cannot solve.

In a major update to its ElevenAgents platform, ElevenLabs has fundamentally changed this dynamic. By introducing Multimodal Support and Scoped Conversation Analysis, the company is aggressively pivoting from a specialized voice-cloning provider into a comprehensive "Agentic AI" powerhouse.

Here is exactly what this means for your daily workflows, your app development, and the future of voice interfaces.

Multimodal Support: The End of "Spelling It Out"

The most immediately impactful feature for end-users is the addition of multimodal message support. The JavaScript SDK (@elevenlabs/client) now includes a sendMultimodalMessage hook.

Instead of forcing users to choose between a text chat or a voice call, developers can now build hybrid interactions. During a live, real-time voice conversation, a user can upload a photo of a broken product, a screenshot of an error code, or a PDF of a receipt. The agent can "see" this visual data and respond verbally in real-time.

This is a massive leap for data extraction and CRM integration. By allowing users to augment their voice with visual context, businesses can drastically reduce call times and eliminate the hallucination risks associated with poor phonetic transcriptions of complex data.

Scoped Conversation Analysis: Debugging the Multi-Agent Mess

As enterprises have started deploying ElevenAgents for complex tasks, they've run into a scaling problem: debugging a multi-agent workflow is a nightmare.

Previously, if you had a "Greeting Agent" that routed to a "Billing Agent" or a "Tech Support Agent," conversation analysis was applied to the entire call transcript. If an evaluation failed, pinpointing exactly which sub-agent dropped the ball required tedious manual review.

With Scoped Conversation Analysis, developers can now apply evaluation criteria and data collection items to either the full conversation or a specific agent node.

Technical Implementation

For the developers under the hood, here is how the new tools shape up:

FeatureTechnical Implementation
Analysis APIPOST /v1/convai/conversations/{id}/analysis/run
New SchemaScopedAnalysisResult (array containing per-agent evaluation breakdowns)
JS SDK HookuseConversationControls().sendMultimodalMessage
Input TypeMultimodalMessageInput (exported from @elevenlabs/client)
Workflow ConfigPromptAgentAPIModelOverrideConfig now includes tool_ids and knowledge_base

By utilizing the new tool_ids and knowledge_base overrides, you can ensure your Billing Agent only has access to billing APIs, while your Tech Support Agent only searches your technical documentation. This sandbox approach is the most effective way to reduce hallucinations in production environments.

Mac and iOS Developers Get a Massive Boost

ElevenLabs has clearly prioritized the Apple ecosystem in this rollout. If you are building voice apps for Mac or iOS, the new Swift SDK v3.1.2 brings several quality-of-life improvements.

The SDK now utilizes LiveKit WebRTC for ultra-low latency audio streaming, ensuring that conversational prosody feels natural and uninterrupted. Furthermore, it features deep SwiftUI Integration. The SDK is fully reactive, meaning your iOS app's UI will automatically update its transcripts and visual states as the AI speaks, requiring zero manual state management from the developer.

ElevenLabs also added environment-specific agent connections, making it infinitely easier for iOS devs to toggle between "Development" and "Production" versions of their agents while testing on TestFlight.

The Competitive Landscape: ElevenLabs vs. OpenAI

Industry analysts are already dubbing ElevenLabs the "audio layer" of the internet, but they face stiff competition. OpenAI's Realtime API offers a highly capable "single-brain" multimodal experience.

However, where ElevenLabs continues to win is in production-ready prosody. While pure LLM-voice models might have a slight edge in raw latency, ElevenLabs' underlying models (like the newly available Eleven v3 and Scribe v2) offer unmatched voice quality, emotional nuance, and character consistency. With the addition of "Versioning" for A/B testing live traffic and structured "Agent Test Folders" for automated testing, ElevenLabs is clearly targeting serious, enterprise-grade developers who need granular control over their voice outputs.

The Privacy Angle: Cloud vs. Local

While the ability to send images, PDFs, and real-time voice data to a cloud-based agent is incredibly powerful, it also introduces significant privacy and cost concerns. Every multimodal message sent to a cloud API consumes tokens, and transmitting sensitive documents (like invoices or personal IDs) to third-party servers is often a non-starter for healthcare, finance, and privacy-conscious users.

If you love the power of voice AI but need to keep your data strictly on your own hardware, cloud APIs aren't the only way forward.


About FreeVoice Reader

FreeVoice Reader is a privacy-first voice AI suite that runs 100% locally on your device:

  • Mac App - Lightning-fast dictation, natural TTS, voice cloning, meeting transcription
  • iOS App - Custom keyboard for voice typing in any app
  • Android App - Floating voice overlay with custom commands
  • Web App - 900+ premium TTS voices in your browser

One-time purchase. No subscriptions. Your voice never leaves your device.

Try FreeVoice Reader →

Transparency Notice: This article was written by AI, reviewed by humans. We fact-check all content for accuracy and ensure it provides genuine value to our readers.

Try Free Voice Reader for Mac

Experience lightning-fast, on-device speech technology with our Mac app. 100% private, no ongoing costs.

  • Fast Dictation - Type with your voice
  • Read Aloud - Listen to any text
  • Agent Mode - AI-powered processing
  • 100% Local - Private, no subscription

Related Articles

Found this article helpful? Share it with others!