Can you hear me now?
The question, if it’s presented with early 21st century pop culture in mind, is rhetorical. If you are relying on text to speech (TTS) applications in Linux, then it’s not…and the answer is no.
No, you cannot hear me now.
I chose to have my larynx removed during the second week of January in order to insure the cancer in my throat would not return. It’s now highly unlikely, but in making that decision I faced the fact that I would no longer be able to communicate spontaneously with others. I would need to rely upon one device or another, but more often than not, a combination of those devices.
I will not use an electrolarynx or any derivative thereof.
At the age of eleven I was frightened terribly by someone talking to my uncle while I stood beside him. It frightened me so much that I ran outside and sat in the car, refusing to go back inside. And sure, most kids won’t have my reaction, but I’m going to insure than no one has that reaction…at least when I speak to them.
But there are other options. One of them I lean strongly toward is text to speech. There are many online tools to accomplish this, and you can see a good example on the From Text to Speech website. Just type in what you want to be said, choose the voice you want to represent your voice and click the play icon. You can even save what you say as an mp3 file. The voices provided by this website are good, but if you are going to compare them to the voices encountered by default in most Linux applications, the difference is glaring.
Let’s talk about those Linux apps.
When I first decided that I wanted to use text to speech on a daily basis, I began researching and testing the available applications. The Mint/Ubuntu repositories showed much promise. The first thing I did was become acquainted with the KDE app Jovie. It’s appeal was that it’s built to work right in KDE, but right out of the gate I ran into a such a high level of complexity and gaping holes in usability that I just shut it down and began searching for other solutions. Apparently, Jovie depends on other voice “synthesisers” to get working.
Really? Why isn’t it an all in one package? Surely there has to be a better tool for the job. I’m not incompetent when it comes to getting stuff on a computer to work. I’ve used Linux long enough to know that complexity will be necessary from time to time. But what I don’t have enough of is time. So I moved on to see what else was out there.
Ah, I see we have a dynamic duo of sorts to work with here. We have TTS, the text to speech program that acts as a front end from which to work, such as Gespeaker. You then have your speech engines like like Mbrola and eSpeak to draw the synthesis from. OK, so it’s gonna be like this for all the Linux TTS offerings, huh? Well crap, if that’s what I have to do, let’s see which is easiest.
I chose Gespeaker because, if for nothing else, it seemed the least complicated. In retrospect, that’s too funny.
I get into Gespeaker and it seems fairly intuitive. It wants me to pick a voice from Mbrola in order to work. Okay…the GUI is telling me that Mbrola is installed. It even guides me to .deb files from which to install a choice of several voices. So I am going to choose the US 1 male voice. Uh, it should be playing the stuff I type in that US 1 male voice, but it’s not. It’s still speaking like Mr. Roboto. Well poop, let me see what I can find. The voices seem to be in the correct directories.
But they are not. In fact, in researching it I discover there has been a running gun battle about these voices and the proper file paths from as far back as 2010…maybe 2008, if I remember correctly. Really? So who’s the moving target. Who is changing these things without proper documentation. Well, first off, let’s see what the developers of Mbrola have to say as to what it does and how it does it, in Synaptic:
“Multilingual software speech synthesizer
Mbrola is Thierry Dutoit’s phonemizer for multilingual speech synthesis. The various diphone databases are distributed on separate packages, but they must be used with and only with Mbrola because of license matters. Read the copyright for details.
Mbrola itself doesn’t provide full TTS. It is a speech synthesizer based on the concatenation of diphones. It takes a list of phonemes as input, together with prosodic information (duration of phonemes and a piecewise linear description of pitch), and produces speech samples on 16 bits (linear), at the sampling frequency of the diphone database.
Use Mbrola along with Freephone, cicero or espeak to have a complete text-to-speech in English.”
Oh yeah…I was looking into getting my speech synthesized based on the concatenation of my diphones anyway. This is why I, and many other Linux advocates, have suggested that developers leave the PR and product documentation to others.
But seriously, the state of text to speech in the Linuxsphere is a mess.
It took five days for me to finally admit defeat and throw my hands up and then wash them from this task. The trouble is, the software I was trying to get to work is seemingly the best that Linux has to offer…and that’s a pity. Maybe this kind of software is better suited for the Android and iPad platforms where there are some remarkable TTS applications for those who are aware enough to search for them.
It’s funny. I will be attending the LibrePlanet 2015 event this year and I will be talking about how important free and open source software is to Reglue. Without it, we would not exist. I find it mildly ironic that I might be making that presentation from an iPad.
My G+ buddy Charlie Kravetz spoke at SCALE just days ago and made serious mention of these problems. There is a collaboration site which I moderate that deals with nothing but trying to figure out the best ways to bring easy to use and inexpensive AAC applications to those who need them. In fact, if the stars align just right, we’ll see you at OSCON to discuss just this topic. Because at this time, those applications do not exist for Linux, at least not those that take less than five days to mess with and still not get working.
Many of you will perceive this as an attack on the various developer of TTS software. Quite the opposite. It’s an appeal to get others interested in picking up where they have left off. It’s apparent that the developers of these various software applications have scratched their itch and walked away. They wrote the software to meet their needs and freely gave their efforts to anyone to alter in any way they see fit. For that, I offer my personal and sincere thanks. You’ve paved the road so as to make the world an easier place to live.
But what of those to come? What of those who cannot speak, or hear…or see. Those who couldn’t tell you the difference between the front end and the back end of any software if it hit them in the ass with a boat paddle?
There are a lot of us out here. There are a lot of us who will finally realize that in many cases, Linux is not the be all and end all to computing. Of course, many of us already know that. I guess that’s what maturity brings.
Ken Starks is the founder of the Helios Project and Reglue, which for 20 years provided refurbished older computers running Linux to disadvantaged school kids, as well as providing digital help for senior citizens, in the Austin, Texas area. He was a columnist for FOSS Force from 2013-2016, and remains part of our family. Follow him on Twitter: @Reglue