ai-tts

How to Stop AI from Butchering Fantasy Names (Without Paying Monthly Fees)

Tired of your AI narrator ruining complex character names and medical jargon? Learn how to build custom, offline pronunciation dictionaries that sound perfectly human—saving you hundreds in cloud subscriptions.

FreeVoice Reader Team
FreeVoice Reader Team
#tts#offline-ai#mac

TL;DR

  • Stop using "sounds-like" guessing: The 2026 standard for custom AI pronunciation relies on IPA (International Phonetic Alphabet) and PLS (Pronunciation Lexicon Specification), bypassing clumsy phonetic spelling.
  • Local AI rivals the cloud: Edge models like Kokoro-82M and Piper deliver lightning-fast, offline TTS on standard hardware, outperforming cloud generation speeds with zero recurring costs.
  • Protect your unpublished work: Cloud TTS services process your manuscript on remote servers. Local, offline generation guarantees your IP remains strictly on your device.
  • The foolproof workflow: Extract proper nouns, verify with a local STT model like Whisper v4, and map the corrections in a simple JSON dictionary before generating your final audio.

You've spent months meticulously writing the perfect sci-fi manuscript, fantasy epic, or complex educational course. You feed your text into a text-to-speech (TTS) engine, sit back, hit play, and immediately wince. The AI confidently mispronounces your protagonist's name, turns medical jargon into word salad, and completely breaks the immersion of the listening experience.

If you've ever tried to fix this by spelling a word phonetically (e.g., changing "Xylo'thrax" to "Zye-low-thraks"), you know how frustrating the trial-and-error process can be. The AI often reads it with the wrong emphasis, unnatural pauses, or a weird robotic inflection.

Fortunately, the voice AI industry has matured. We are no longer reliant on hoping a cloud algorithm guesses correctly. In this breakdown, we'll explore how modern offline text-to-speech engines handle custom dictionaries natively, saving you both time and hefty monthly subscription fees.

The End of "Sounds-Like" Guessing (Welcome to IPA)

To build a reliable custom dictionary, you have to speak the language of the AI model. As of 2026, the industry has universally standardized around precise phonetic mappings rather than text-based guesswork.

There are three primary methods models use to ingest your custom lexicons:

  1. PLS (Pronunciation Lexicon Specification): This is a W3C standard XML format utilized heavily in enterprise-grade setups. It is incredibly robust but can be tedious to write by hand. You can review the structure in the W3C PLS Official Documentation.
  2. IPA (International Phonetic Alphabet): This is the current gold standard. Because most neural TTS models use phoneme sequences under the hood, passing an IPA string directly to the engine bypasses text normalization entirely.
  3. RegEx/Text Normalization: Great for global structural changes. For example, using regular expressions to ensure the abbreviation "Dr." translates to "Doctor" when preceding a name, but "Drive" when appearing at the end of a street address.

When using modern local engines, a custom dictionary is often as simple as a local .json file that maps a written word directly to its IPA equivalent:

{
  "Xylo'thrax": "/zaɪloʊθræks/",
  "Otorhinolaryngology": "/oʊtoʊˌraɪnoʊˌlærɪnˈɡɒlədʒi/",
  "Ngata": "/ˈŋɑːtə/"
}

If you are intimidated by writing IPA, open-source tools like bootphon/phonemizer can automatically translate text into phonemes to get you 90% of the way there, allowing you to manually tweak the vowels and stress marks.

Local Powerhouses vs. The Cloud Tax

For years, getting human-like emotion and accurate pronunciation meant paying exorbitant subscription fees to cloud providers. But a massive shift toward "edge computing" has flipped the market.

According to r/TTS - Best way to handle fantasy names in AI Narration, authors are increasingly abandoning cloud solutions due to escalating API costs and data privacy concerns. Let's look at how the platforms stack up today.

The 2026 TTS Landscape

ModelTypeDictionary MethodBest Use CaseCost
ElevenLabsCloudAPI-based / GUIHigh-end remote processing~$22-$99/mo
Kokoro-82MLocal/EdgeJSON Phoneme MappingHigh quality mobile/desktopFree (Open Source)
PiperLocal/OfflineOnnx-based LexiconsMass audiobook processingFree (Open Source)

If you look closely at the performance benchmarks, the argument for paying a cloud subscription begins to crumble. On a standard Apple Silicon M3 Max Mac or a Snapdragon Gen 5 Android device, the open-source rhasspy/piper engine generates one minute of audio in roughly 1.2 seconds.

The hexgrad/Kokoro-82M model—the current darling for mobile application integration due to its incredibly efficient 82-million parameter footprint—generates a minute of ultra-realistic audio in ~2.5 seconds directly on your phone.

By comparison, waiting for a network handshake and rendering through a cloud service like ElevenLabs takes roughly 4-6 seconds per minute of audio. You are paying a premium to wait longer, all while handing over your proprietary manuscript data to a third-party server. (You can read more about their dictionary limitations in the ElevenLabs Pronunciation Dictionary Guide).

The Ultimate Audiobook Workflow (Extract, Verify, Correct)

If you want to narrate a 100,000-word fantasy novel using a local engine like Kokoro or Piper, you need a bulletproof workflow to manage made-up locations, unique character names, and magical items.

Here is the industry-standard workflow for 2026:

  1. Extract: Use a simple python script (or your text editor) to extract all capitalized proper nouns from your manuscript into a raw list.
  2. Verify (The AI Feedback Loop): Take a small sample of your audiobook and run it through a local Speech-to-Text (STT) model like Whisper v4. By reading the transcription, you can immediately spot where the TTS engine naturally guesses the pronunciation wrong.
  3. Correct: Build your local JSON lexicon. Define the correct IPA syntax for every difficult name.
  4. Inject: Point your desktop TTS tool—such as OpenVoiceOS or a community fork of Coqui-ai TTS—to your new dictionary file.

Once injected, your local AI model will dynamically swap out the text for phonemes right before generation. The result is perfectly emphasized, seamless narration with zero manual audio splicing required.

Real-World Impacts: Beyond Fantasy Fiction

While fantasy authors are the most vocal about custom lexicons, the ability to control local AI pronunciation offline has massive accessibility and diversity implications.

  • Neurodiversity Support: Readers with dyslexia or visual impairments rely heavily on TTS. A custom dictionary allows users to inject phonetic pauses, slow down specific complex words, and adjust regional vernacular without waiting for a cloud server to buffer.
  • Cultural and Regional Accuracy: Standard cloud models often force an "Americanized" or "British" standard on native names. Local dictionaries allow users to force the correct pronunciation of indigenous names (like Māori terminology in New Zealand English) globally across their device.
  • Medical & Legal Jargon: Educational audiobooks are frequently bottlenecked by the AI mispronouncing drug names or Latin legal terms.

We see robust mobile implementation of this via apps like Voice Aloud Reader (Android), which features a brilliant RegEx-based local editor. In 2026, Android's Gemini Nano and Apple Intelligence's Personal Voice allow deep, system-level overrides for how your phone reads text aloud.

Why Local-First is the Only Sustainable Path

The bottom line is that the "Cloud AI" era of TTS was a stepping stone. As edge models shrink in parameter size while growing in natural prosody, paying $99 a month for a service that restricts your dictionary size and monitors your data is becoming obsolete.

By adopting offline engines and taking 20 minutes to learn basic IPA formatting, you take absolute, permanent control over how your text sounds. Your audio generates instantly, your internet bill doesn't dictate your productivity, and your unreleased manuscripts remain exactly where they belong: on your own hard drive.


About FreeVoice Reader

FreeVoice Reader is a privacy-first voice AI suite that runs 100% locally on your device. We utilize a "Hybrid Lexicon Engine" that syncs your custom dictionaries seamlessly across platforms, allowing for a point-and-correct UI where you can tap a mispronounced word and instantly update your device's PLS/JSON files.

Available on multiple platforms:

  • Mac App - Lightning-fast dictation (Parakeet V3), natural TTS (Kokoro), voice cloning, meeting transcription, agent mode - all on Apple Silicon
  • iOS App - Custom keyboard for voice typing in any app, on-device speech recognition
  • Android App - Floating voice overlay, custom commands, works over any app
  • Web App - 900+ premium TTS voices in your browser

One-time purchase. No subscriptions. No cloud. Your voice never leaves your device.

Try FreeVoice Reader →

Transparency Notice: This article was written by AI, reviewed by humans. We fact-check all content for accuracy and ensure it provides genuine value to our readers.

Try Free Voice Reader for Mac

Experience lightning-fast, on-device speech technology with our Mac app. 100% private, no ongoing costs.

  • Fast Dictation - Type with your voice
  • Read Aloud - Listen to any text
  • Agent Mode - AI-powered processing
  • 100% Local - Private, no subscription

Related Articles

Found this article helpful? Share it with others!