An In-Depth Look at Text-to-Speech in Linux

By Ken Starks on April 7, 2015 |

It’s been an interesting two weeks, talking about and looking into why text-to-speech (TTS) is such a mess in Linux. I’ve spoken with seventeen of you; seventeen who know a bit about software programming. “A bit” is a purposeful understatement. Some of you have forgotten more about software programming than I will ever know. That being the case, I have learned a bit about why TTS in Linux is next to worthless. For those who are just joining into the conversation, let me catch you up quickly.

Late last year, I was told that the area treated for throat cancer in 2012 was exhibiting pre-cancerous activity. I was told that it could remain “pre-cancerous” for twenty years, or it could again form into the cancer that tried to kill me in 2012. If that happened and it remained unattended, it would kill me in a matter of months. My options ranged from doing nothing and taking my chances, all the way to having my larynx removed to be done with this throat cancer monster once and for all. I picked door number two.

I began researching my options as a soon-to-be voiceless person. In preparing for a life without voice, there were several scenarios in which I failed to consider:

What if I was caught in a dire situation and my only option for survival was to call out for help?
What if I was witnessing someone in a dire situation and I could not call out to warn them?
What if I needed to travel and could not find or ask directions?
I cannot use the telephone in an emergency.
In that most verbal interaction is done on the fly, I will miss out on a lot.

As a side note of interest, I now carry a small pocket-sized but extremely loud air horn wherever I go. Silly? Not if I ever need to use it.

I’ve already experienced the third example on my list and it worked out okay. I was able to stop people and ask them questions via my sloppy writing on a notepad. But the results of the other examples could range from being merely inconvenient to deadly. I’m hoping for mere inconvenience.

It’s the day-to-day, moment-to-moment things that I will primarily miss out on, be that badinage or booby trap, and it’s already proving to be well on the wrong side of convenient. This is why I took up the banner of speech for the speechless. Since I use Linux full time, my focus turned immediately to computer-centered solutions to this problem. Go figure.

I’ll paraphrase my previous article on this matter: TTS in Linux sucks.

Now before you gather into torch and pitchfork hoards, let me walk that back a few steps. The fact that TTS in Linux sucks is not necessarily a reflection upon the developers. This specific problem can be found within the tools the average open source developer has at his disposal. Or the lack of tools in this case. Let’s begin examining TTS in the Linuxsphere and hear the difference.

eSpeak or no eSpeak…that is the question.

In the beginning of trying to find a new voice, I researched the options in Linux and was excited by the number of TTS solutions Linux offers. But one by one they were ticked off my list for the same reason: They suck.

Nextcloud resilient communication and collaboration.

Let’s talk about the first option I tried:

eSpeak is an application found in many distro’s repositories and on sites such as GitHub. Since it’s easy to install, I anxiously clicked on it and watched as the application popped into existence on my desktop. What a nice GUI, consisting of a simple text field with the few needed configuration commands found under an “edit” field at the top of the screen. Brevity on the face is nice.

** If our coverage matters to you, please consider supporting our work through our FOSS Force Independence 2026 fundraiser. **

But the brevity of the simple GUI lost its charm when I actually tried using the app. This is what I heard when I clicked “play” within eSpeak. Listen for yourself. Really…This is what we have? And again, let me say…this isn’t the developer’s fault. He or she did the best that could be done with the tools available for use.

It does get a little better, because there were better voice options for the developer. My Google+ buddy Neil Munro took on the nasty task of getting better voices working within eSpeak using Mbrola. Even though he succeeded where I failed after days of futzing with it, even he said that the process of getting voices installed from Mbrola into eSpeak was absolutely ridiculous.

First off, as a computer user, you should not have to install a separate application in order to get the first application to work. Dependencies are one thing. Entire other applications is far and away another. My unanswered question, as a layman mind you, is why can’t the voices be coded into the application instead of users having to install it separately? However, doing so did improve the voice…not much mind you, but with a striking difference. Here is how it sounded with a Mbrola voice installed.

The difference is striking to say the least. But even with that much improvement, my personal opinion is that it isn’t ready for prime time. That’s said after spending a complete afternoon using the application on my Nexus 7. It does not handle punctuation well at all and it just skips some words that it cannot pronounce properly.

It wasn’t until almost a week into this TTS discovery voyage that I found Mary. Not a human named Mary; a text-to-speech engine named Mary. Mary is an open source application with the downside, for many, of being a Java app. That may be a downside for you, but for me, with my options being limited day by disappointing day, Mary just might be the girl of my dreams. Listen to what Mary has to say, with my thanks to the Moody Blues’ “Days of Future Passed.” What an amazing difference. Amazing.

How this application is executed, I’m not sure. Is it to be baked into a website? That’s what I’m gathering. Is there any way this app can become part of installed and usable software in Linux? I would appreciate a professional answering this for me: Is she only available as a web application or is there any hope for the speechless to use her on their personal computers? I am extremely excited about hearing what some of my Java programming buddies have to say.

In my travels I also found a Windows only app called CoolSpeech. I may give this a run in Crossover Office or Wine sometime in the coming week. From just a few minutes playing with it, I’m not particularly impressed when I compare it to Mary TTS or my personal online subscription tool, SpokenText.net.

SpokenText is by far the best I have examined to date. My Reglue presentation at MIT for LibrePlanet was done via a pre-recorded Ogg file from SpokenText as I stood at the podium and “talked” about what we do at Reglue. I was on edge the whole time. In my mind, this seemed like a ridiculous idea. However, the response to my presentation was heartening. I have submitted my white paper for Texas Linux Fest in San Marcos this summer; we’ll see how that’s received. Here is a sample of SpokenText.

I ended up paying $90.00 for an annual subscription to use the site to record my text-to-speech. I find it a bit ironic that the absolute best TTS to be had is the one most easy to use…that is if you pay for it. To me, 90 bucks was well worth the price of admission, especially as I was under the time gun to get my presentation ready for LibrePlanet 2015. Those who might want to listen to my LibrePlanet presentation for Reglue can do so here. Turn your volume down a bit though, there are some annoying clips during the first third of the file.

So after two weeks of working to find a suitable Linux TTS application, I am of mixed emotion. I am excited that the technology exists, as evidenced by a couple of the apps mentioned above. There is also a huge jump in synthetic speech generally, as demonstrated by the TV game show “Jeopardy,” where the world was introduced to “Watson.” Unfortunately, as wonderful as that synthetic voice sounds, it is guarded 360 degrees and 24/7 by a multitude of patents and exorbitant cost. IBM could throw us a bone by releasing that technology as FOSS, but I ain’t holding my breath.

Therein lies our problem. Our developers cannot use the majority of these voices because they are available only under extremely expensive licensing. And what normal Linux developer or software programmer can afford that?

All in all, Google may again upend the entire thing by their offering to allow Android apps to run in Chrome on most every device. My personal text-to-speech tool is called simply Speech Assistant, and it’s an Android app. It’s the best I have found so far and it works the way I need a text-to-speech tool to work. I use it on my Nexus 7. Hopefully the global community of tens of thousands of open source developers can find a way to make it work on Linux.

Ken Starks

Ken Starks is the founder of the Helios Project and Reglue, which for 20 years provided refurbished older computers running Linux to disadvantaged school kids, as well as providing digital help for senior citizens, in the Austin, Texas area. He was a columnist for FOSS Force from 2013-2016, and remains part of our family. Follow him on Twitter: @Reglue

linuxlock.blogspot.com/

Published in Operating Systems and Software

More from Operating SystemsMore posts in Operating Systems »

More from SoftwareMore posts in Software »

20 Comments

Mike April 7, 2015

I’m not surprised Mbrola is difficult to integrate…it isn’t FOSS.

Anyway, what you said caught my interest: “Our developers cannot use the majority of these voices because they are available only under extremely expensive licensing.”

I’m reminded of the restricted world of fonts on Windows. There are some freely available ones, but many are locked away under ridiculous licensing. That and font creation programs were rare/expensive beasts indeed until FontForge came along.

Perhaps what is needed is something like that for the Linux world, where anyone can create a “voice” using a FOSS tool. All we need is a well-documented format.
Mike April 7, 2015

> “My unanswered question, as a layman mind you, is why can’t the voices be coded into the application instead of users having to install it separately?”

Of course that can be done, but then you can’t change it without recompiling the application, so you trade flexibility for simplicity. It’s fine if you like the included voice, but what if you don’t?

That said, there is usually an acceptable middle ground where software comes bundled with some content, but provides an easy way to add new content. Think icons, wallpapers, cursors, and of course, fonts.
Duncan April 7, 2015

Addressing the same point as Mike…

> “My unanswered question, as a layman mind you, is why can’t the voices be coded into the application instead of users having to install it separately?”

My take on this is that different voices are in effect “themes”, just like different icon themes, color themes, window decorations, etc. You get one or a few limited choices built-in, but the whole world of available choices doesn’t open up to you unless/until you go looking for available theme-packs, which naturally come as separate packages.

It just so happens that the built-in espeak voice is very “robotic” sounding, I’d guess because, for many people, it’ll be a novelty or an effect, and the extreme robotic sound is a plus. That’s not the case for you and others who really /need/ TTS because they don’t /have/ a natural voice to use, but the fact is, that’s likely a miniority of the userbase, with more users simply installing it as an optional novelty.

Meanwhile, if as others have suggested, mbrola isn’t freedomware, then it /couldn’t/ ship prepackaged with the freedomware, due in part to freedomware license restrictions intended to /keep/ it freedomware. Which is where (other?) Mike’s other suggestion comes in. If there were an easy standardized way for people to create their own voice themes, as there now is for fonts in the form of fontforge, there’d likely be more freedomware voice themes available, as there now is fonts.
Mike April 7, 2015

One additional complication to producing voices like content is that different TTS applications may use different voice formats to work with their unique speech engines. This is just speculation on my part as I have no specific experience with TTS software, just software development in general. If this is the case, then standardizing on a single format may be difficult or impossible without promoting a single speech engine above others. We all know what happens when you try to reconcile multiple standards: https://xkcd.com/927/
Davide Repetto April 8, 2015

I noticed you don’t mention loquendo in between your experiments.
(http://www.nuance.com/support/loquendo/index.htm)
I find that it is one of the highest quality tts out there and it would be useful especially for presentations, where quality is more important.
hayder April 8, 2015

I recommend giving Cereproc and IVONA TTS a try. Although they are commercial, the voices are very high quality and usable “offline”.
Alan Bell April 8, 2015

espeak is indeed very robotic sounding, it is also very small. This means it could be included in various Linux distributions that were aiming at the size of a CD. Another property is that it is very configurable, and can speak *very* fast. There are blind people who listen exceptionally quickly and like espeak because it can go fast without the pitch being really high and squeaky.
OpenMary is great, it uses hidden markov model voices and is dramatically more natural sounding, but won’t go as fast if speed listening is what you are after. You can use it with speech dispatcher on Linux, speech dispatcher is an abstraction layer that allows you to use tools like orca with a variety of speech synthesizers, you can use it with a local or remote instance of OpenMary, and yes, OpenMary is open source and you can run it locally, it is just a bit more space and CPU dependent than espeak, but these days it mostly doesn’t matter. http://www.theopensourcerer.com/2011/05/speak-to-me/
Hello April 8, 2015

“TTS in Linux sucks”: this statement is hugely misleading.

This is nothing to do with Linux.

You should instead say it is challenging to get TTS working right. Did you expect that voice synthesis is something open-source developers can easily get working? There is a whole industry behind voice synthesis. The only good solution that you found is one that you have to pay for. That should tell you something.

Please don’t put blame on Linux for something that it doesn’t deserve.
Jeff Sadowski April 8, 2015

There is a chrome plugin SpeakIt! that I think sounds far better then all the ones you posted. What sucks is I don’t see a way to use it outside of chrome. But I have to agree text to speech just doesn’t cut the mustard.
kb0hae April 8, 2015

It seems to me that in the last few days I saw (but didn’t read) something about running android apps in chrome…
Just Saying April 9, 2015

I realize this is a FOSS website and Apple is teh evil, but the fact is OS X has the best out of the box TTS setup. And it’s voices are the most realistic. I’m not saying you should use a Mac (I know, I know, real UNIX is scary and Steve Jobs was a bad man), I’m saying that that should be your goal. Not some poor quality a Windows software. And not even some of the web options. If you’re going to reverse-engineer, target the best.

As another commenter said, there is a whole industry around this stuff. There are solutions, but most are very expensive. This is a complex problem that extends far beyond FOSS. I work in the accessibility community. One of the reasons we have people buy Macs (and no, I’m not saying you should do that) is because it is the only platform out of the box that works well.

Rather than talking to FOSS people who don’t understand the challenges, maybe start with the accessibility and deaf communities.
Neil April 9, 2015

You may want to look at android solutions, the Google TTS engine sounds great on my Samsung S3, apparently several generations more advanced than the best that you have found under Linux.

This may be a dumb question, but have you considered one of those throat vibrator things where you just rest it on your neck and mouth the words?
I realise its an old-school approach but in use its immediate, much quicker than typing, and also doesn’t have the downsides of needing to always carry a computer of some sort.
Jeff Sadowski April 9, 2015

At the very least you should record your voice as you say each syllable. Maybe even with different inflections on those syllables. Maybe there is a way to splice your own syllables together to make the correct spoken words.
Jeff Sadowski April 9, 2015

I meant each phonic not syllable.
Duncan April 9, 2015

@ Neil:

> [H]ave you considered one of those throat vibrator […]

Yes. That was actually covered in an earlier installment. Ken had an experience as a kid where he was scared half to death by someone using such a thing, and he has vowed never to be the one scaring another kid. Given that he’s the man behind the Reglue mission, sourcing second-hand computers, repairing them and sticking Linux on them to give away to kids who otherwise wouldn’t /have/ a computer in the home…

So it’s not an option.

@ Jeff Sadowski:

> record your voice

What voice? Ken had his entire larynx removed in ordered to eliminate for good the chance of his throat cancer coming back (it was precancerous again, already). He has no voice to record.

Maybe if he had thought to do that before he had the operation…
Dan Saint-Andre April 10, 2015

Years ago I worked in the telephone industry during the early days of Interactive Voice Response (IVR). [That is the technology that resulted in the “Press one or say, ‘yes'” prompts when you call a large company.] They relied on phonemes for the spoken languages. I found this page for Macintosh developers: https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/SpeechSynthesisProgrammingGuide/Phonemes/Phonemes.html
It is not enough to find a vowel in the text (a,e,i,o,u) and spit out “ay”, “ee”, etc. Consider the words: father, said, apple. Each has a different sound for the letter “a”. Factor in regional differences — Boston vs. Savannah — and it gets more interesting. Now you might have some idea why good implementations have a price tag.
Martin April 10, 2015

Although not open source, if you don’t mind running proprietary software I find Cepstral quite good. I wrote a python script to take the contents of the clipclipboard and pass it through CepstralCepstral (swift). I set up a custom keyboardand shortcutshortcut to run said script. I find this a workable solution. Cepstral.com if ya interested.
Ben April 11, 2015

Don’t forget about Festival at http://www.cstr.ed.ac.uk/projects/festival/. There is also a lightweight version called Flite.
Ken Starks April 15, 2015

To Just saying…

You may be surprised by me expressing this, but when it comes to tools like tts; partisanship and fanboi-ism needs to be kicked to the ditch. For reasons stated in the article, being without a voice can be deadly in some circumstances. So I will say it in the least confusing way possible:

I don’t care if the app is written in Unicorn blood. I am going to use the tool that best fits my needs and allows me to live in a safe and productive manner. The ability to remain productive for the rest of my life transcends politics and fanboi postures. If the Linux or FOSS “solution” doesn’t solve the problem, then maybe it’s time to look elsewhere.

And as an aside, without saying too much; Spencer Hunley and Neil Munrow are doing some amazing work in bringing TTS into easy usability for the masses. DeeAnn Little is the project manager for what we hope is a way to use text to speech easily and for little if any cost.

While I cannot release any specifics, just let it be known that things are beginning to get exciting. Guys like Charlie Kravetz and myself…everyday Linux users; are seeing magic right before our eyes. This is encouraging given the fact that this project is just now getting off the ground.

But to my main point. I may dislike the business practices and tactics used by “the other guys”, but if it comes down to it, I’ll use or fork any solution that gives me my life back.

And our newest member, David; is making MaryTTS usable for single users without the need of connecting to a server somewhere. We’ll keep everyone posted on that as well.
Ken Starks April 15, 2015

Neil, I would be amazingly happy if Google Voice put their male US voice back in the rotation. I don’t know why the male voice disappeared but it’s no longer in the available voices.

Our group is working toward and considering Android apps that can be deployed. And as mentioned in the comments (thank you)…Android apps are already being prepped to run in a Chrome browser. I don’t believe the OS has any bearing on that being a choice. And we’ll look at every choice available. Neil Munro is working on an absolutely fantastic top grade extension that makes text to speech in Chrome the very best in its class.

This is what happens when like minded people get together to get stuff done. We’re tackling this from every direction available to us.

Jeff S. Thank you. And you are correct. The SpeakIt addon is just this side of miracle work. Not sure of this but I believe this addon is an extension of the stand alone app named SpokenText.net. That app is amazing as well and I’ve used it from time to time.

But some of the real promising works are coming from building MaryTTS into your computer. The actual server side part of the app can be installed on the laptop or desktop at large. When using the app, it doesn’t have to connect to another remote server, thus speeding the whole thing up nicely. I’ll be attempting this in another couple of days and I’ll report at a later date. We want to make sure all of the solution vectors are presented at once, thus eliminating the urge to jump on the one that’s working best at the time. But that’s the beauty of having an experienced Project Manager working with us.