Anyone who has read FOSS Force for the last couple of months knows that I lost my voice to cancer and that I’ve become personally involved in getting a decent text to speech (TTS) application developed. Some of you have reminded me that there is a good assortment of text to speech applications for Linux, especially in the mobile market, such as Android and the iExperience. Granted, for both examples, but we are needing an application that can either come preinstalled or be easily installed on almost any Linux distribution. That leads us back to the plentiful choices within the Linuxsphere you feel the need to mention. Yes, there are a lot of them, but when it all gets boiled down, they all share one simple trait.
None of them even approach usability for the everyday computer user. None. And you would think that of all these choices, one of them has to work…or provides documentation reasonable enough for everyone. You would think.
They are a hodge-podge assortment of half-finished, half-baked good intentions wrapped in the shiny label of “open source.” Isn’t that nice? Many of us rush “open source” to the fore, like it’s a magic phrase — a sparkling breath of glittery mist that coats the project like it was the long-sought solution to whatever problem it promises to fix. It’s nice to have that idea by which we can comfort ourselves. It’s a warm blanket of security, knowing that all of those tools are available to anyone, with all forms of “freedom” intact.
Except that it’s not true. When you peel back the shimmering wrapper of open source expectations, only to find a promissory note…well, that’s known as a rude awakening. The rude awakening that I received when I took all my expectations and beliefs to the Well of Open Source, only to find it with a broken crank and a cut rope.
It didn’t start out that way. The people who wrote these programs and applications wrote them to the best of their ability. They wrote them and left a promissory note, in hope that someone would be along shortly and complete the work that they did not know how to finish. And that’s the way it’s supposed to work. That, at least in theory, was the purpose behind places like SourceForge, back when SourceForge was a great idea with almost unlimited pools of talent spilling out open source products for the world to use. That’s the golden heart of open source. But somewhere the system got broken.
Applications like Festival started out to be promising, but the developer had to rely on others or other sources to provide the actual “voices” for the application. I needed to find out how to install a decent voice in Festival. “Oh…that can’t be that hard, now can it?”
This is the first thing I found when I searched for and discovered the answer:
Setting up Festival and using better voices
——————————————————————–Section 1: Get Festival up and running with basic voices
- Additional reading:
Here are some places I pulled notes from:
http://gentoo-wiki.com/HOWTO_speechd (speechd and festival notes – mbrola too)
http://forums.gentoo.org/viewtopic-t-195579-highlight-festival.html
http://dailypackage.fedorabook.com/index.php?/archives/42-Productive-Monday-Festival-Speech-synthesis.html#Someone elses notes on how to build your own festival voice based on recordings of your voice:
http://www-csli.stanford.edu/semlab-hold/muri/system/festvox.html- You definately need to use Festival 1.96 or better, the older version sound very poor:
http://festvox.org/packed/festival/latest/#Get these two packages to start:
festival-1.96-beta.tar.gz
speech_tools-1.2.96-beta.tar.gzUnpack these tars in the same parent directory, festival will unpack into a directory called “festival”, speech tools into “speech_tools”. Compile speech_tools first, then compile festival. Next unpack these other packages in the same parent directory (these get loaded into directory “festival”).
festlex_CMU.tar.gz
festlex_OALD.tar.gz
festlex_POSLEX.tar.gz- Next are the voice packages:
festvox_cmu_us_awb_arctic_hts.tar.gz
etc.- These packages help the voices to sound MUCH better:
festvox_kallpc16k.tar.gz
festvox_kedlpc8k.tar.gz
festvox_kedlpc16k.tar.gz- Now edit this file to use the new voices:
./festival/lib/siteinit.scm
;And add this line:
(voice_kal_diphone)
;And then change the line like this to your new voice (notice the prepended “voice_” to the voice name):
(set! voice_default 'voice_nitech_us_clb_arctic_hts)
What we expected to be a cool pool of promise turned into a fetid dunk tank for the masses. Really? This is what a new Linux user will find when seeking a substitute voice? No one can honestly say that the default voices for any of these apps are ready for prime time. So this is how I get better voices into Festival? Tell me you are kidding…
If that wasn’t enough, it gets worse.
In the case of Festival and other TTS programs, the voices tend to be extremely robotic and harsh. Other programs, like Mbrola, can supply voices to improve those default voices. Each of the different developers upgraded/updated their voice apps without telling the synthesis folks that the directories for their files and voices had changed. So the few of us who stiffened our lips and dove into all those lines of code found the whole damned mess to be broken. And we found it useless. That refreshing pool of promise that represented open source was quickly discovered to be a stagnant pool where even mosquitoes wouldn’t nest.
Who is at fault? Who do we blame? This isn’t about blame; it is about solutions. Solutions to fix a broken system. Solutions that many of us have the talent to provide. But talent isn’t the only ingredient in this success. It needs people with the will to help create the solutions to these problems.
I began a quest to find people to help me build a front end for the TTS application, MaryTTS. It’s an open source Java app that is beautiful in production, but an absolute nightmare to get working on the everyday computer user’s machine.
Luckily, a good guy by the name of David walked me through the process of making it work on my computer. It’s not easy, trust me. You basically turn your computer into the server for the app. You can create a server off-site, but that creates some latency issues you might not find comfortable.
Neil Munro worked up a Chrome extension that provided basic TTS. Although he writes his efforts off as “nothing special,” it’s a step forward. It matches the other Chrome addon efforts, but stays simple instead of inundating users with an array of hacks and cosmetic options. I honestly don’t care about the text color; it’s not important. Just speak for me. That’s all I want. I just need the app to provide me a voice.
I promised not to mention their names, but a couple of guys, and maybe a third, are working on this GUI for MaryTTS. And when it comes down to it, it’s not really that fancy. You use your browser, or the browser GUI our development friends are building, to pull the whole thing together. I don’t have to rely on an online tool or importing voices that are all but impossible to incorporate. And while most ISPs are showing off some fairly admirable “up time” statistics, it’s good to have it all under one roof. Because when an outage does occur, it’s good to know that the tool you need isn’t dependent on an internet connection.
For me, this is an important step, and a step that wouldn’t be taken if I had not lost my voice to cancer. Had I not lost my voice, I couldn’t have cared less. But here I am and here’s where it gets personal.
I spend a lot of time in small groups. We all do. I spend time in groups of peers or groups of parents and kids that are helped by Reglue. And of course, within intimate groups of my family. Anyone who has studied the social group dynamic knows how important voice inflection and timing can be, as the conversations within take place. Let me give you a case on point. Someone I have begun an email friendship with is an extremely high functioning autistic. Actually, she is brilliant. She mentioned this dynamic in her own life and I smiled as I read how she deals with the varying voice inflections and timing and how difficult they can be to interpret as the waves of other conversations intermingle.
A few weeks ago, my oldest daughter and her family drove from Copperas Cove to spend the day. Some would remark that our living room space is, uh…cozy. That’s a polite word for “small.” Add two more people and we can move up to the term “cramped.” So everyone gets seated and the conversations begin. I have on my lap my Nexus 7 for the Android app “Speech Assistant” and my “handy-dandy-this-is-what-etch-a-sketch-has-evolved-into” Boogie Board.
It didn’t take long for me to realize that I was the proverbial salt on the bird’s tail, when it came to adding or commenting in the group. When I felt that I had something to contribute, the group would wait politely for me to write or type out what I wanted to say. That includes the times necessary to erase a mistake or redo something I had written by accident.
As Ron White would remind us, “Now that there was some awkward social presence, I’m here to tell ya.”
When I felt the need to interject something, by the time I had it written or typed in, the subject had moved in a completely new direction, and introducing the comment on my boogie board would have caused confusion, at the very least, not unlike watching a movie when the lips and the words are way out of sync. Had I not been in front of family, I would have excused myself and left the room, not to return.
I have begun experimenting with the electronic larynx for use at home and around friends. You know…the one I swore I would never, ever, use? According to my ENT surgeon, I have the toughest and the thickest neck tissue he has ever seen. Trying to find just the right pressure and placement for the head of the device can change from use to use. What bothers me most is that the electronic larynx is the gold standard, the go-to tool for communication after a laryngectomy. Unfortunately, my specific situation is not suitable for a TEP device, which is a surgically-implanted device that allows one to speak.
After a while, one begins to realize that (s)he has no real place in that specific social group. The inability to contribute in a timely manner, combined with the just plain awkwardness of the situation, can lead to one becoming a social deportee of sorts. I found myself being the “go’fer.” I was the one to refresh drinks and foodstuffs. I was the one to entertain my granddaughters when they got unruly or bored. In short, I was a stage prop for the play that was being acted out around me.
That’s an uncomfortable place to be, I don’t care how you try to analyse it.
But you know…in all of this I realized that the speechless are not the only ones in this situation of being a social deportee of sorts. While we are not castigated, we most certainly learn to adapt to our place in group…and we tend to avoid social situations in which we are expected to react.
One of my best friends is a brilliant guy and he’s an absolute scream to be around. Sometimes I laugh so hard that I slobber down the front of my shirt when I’m around him. His Ph.D isn’t a captured, glass-framed moment in time for him. It’s a reminder that no matter what we accomplish, we can always accomplish more.
While he isn’t speechless, he does suffer acute hearing loss. We talked about this recently, and for him, it’s the exact same thing: A physical condition that regulates his social activity. He doesn’t find being part of a social crowd comfortable, especially if they are strangers. Often he makes sure that his wife or a friend is close, so he doesn’t ask the person speaking to repeat themselves over and over. The person with him can act as a repeater of sorts. Unfortunately, his hearing issues have no medical or hardware solutions. Hopefully, someone with an open source mindset can, one day, fix this.
And that’s what it all comes down to. Being willing to donate your time and talent to help those who have no other alternative. Be they financial or geographically situational, many people don’t have the options they need to make their lives better. That’s where the real open source community can help.
I started out by offering a bounty for the MaryTTS GUI.
Everyone who has responded, politely told me to stuff my money in my, uh…ear. They said they would make a solution shortly…they just need to find the time to do it. From what I am understanding, this project is moving along at a decent rate. Hopefully, one of the greatest problems speechless people experience can be lessened or even wiped off of the table completely, at least in the Linuxsphere.
It just depends on how attuned we are to those needs. All of us. And I can understand the hesitancy to accept money. That contract can be perceived as a binding lever to control the rate, quality and gauge of your work. I understand that. But in return, I’ll demonstrate how important this is to me.
I have a bit of money put away. One day I would like to take my youngest daughter to the Chicago Museum of Science and Industry. My dad took me there when I was eleven years old and it was a mesmerizing experience, all the way down to walking through a WWII German submarine. It was pure sensory overload as I took it all in. The hair on my arms stood at rapt attention as the history of transportation display took my breath.
My daughter, who helps with Reglue stuff at times, told me that this need for TTS software was way more important than a trip to Chicago. She made me promise to use the money for this project. She even tried to donate a $100 bill to the effort, an offer I of course refused. I took the 100 dollar bill from her and put it in her shirt pocket, along with a kiss to her forehead. A $100 donation via PayPal showed up later in the day. I’m not at all sure the name of the donor was correct.
Yeah…that’s my girl.
So this isn’t about just me…or you, for that matter. This is about a young lady who works two jobs, selflessly offering the only value she has at her disposal. This is about a fourteen year old boy who had his tongue splayed as a warning to keep his mouth shut after witnessing a murder in the ghettos of Anfield in Liverpool. This is about people with problems we can help. People who have no idea that a loose-knit community of technologists might be able to give them at least a part of their life back.
This isn’t about your money. It’s about your time and talent and how you can spend that to help those who cannot help themselves. And if you need money for your efforts, I don’t begrudge you a dime…just keep in mind that we don’t have a lot of it.
The only real problem is in finding out where to go to help with a particular program or effort that matches our talents and skills. I’m a good place to start. There are four brilliant people now working on providing an easy to use GUI for the MaryTTS app, or other apps for that matter. That is indeed a start. Lets talk about your skills and ways you would like to help.
Ken Starks is the founder of the Helios Project and Reglue, which for 20 years provided refurbished older computers running Linux to disadvantaged school kids, as well as providing digital help for senior citizens, in the Austin, Texas area. He was a columnist for FOSS Force from 2013-2016, and remains part of our family. Follow him on Twitter: @Reglue
The beginning of your article makes it sound like the lack of user friendly TTS software is somehow a failure of open source. It certainly is not. Just because something isn’t finished or doesn’t meet expectations doesn’t make it a failure. It just means it isn’t done…yet.
Open source isn’t a panacea, just a tool – one of the most effective tools in existence. The promise and success of open source is EXACTLY why you can have developers come in and build on existing works to meet your needs. Without open source, everyone would have to reinvent the wheel every time an existing piece of software didn’t quite fit or was abandoned by its creator.
I’ve mentioned this before – mbrola isn’t FOSS, so is not really relevant in discussing open source software. If open source were the issue, then surely mbrola would be the solution you are looking for, right?
“The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.” – George Bernard Shaw
Please don’t adapt to your situation. Please don’t accept things as they are.
Others wanted to fly, to navigate underwater, to jump from outer space without a plane (!) We will have some way of making possible for you (and many others) to interact and feel part of the conversation. Heck, they’re managing to get responses from people in vegetative state!
The other day I mentioned the need to address aspects other than speech. This video might give an idea (it’s about sign language to speech translation):
https://www.youtube.com/watch?v=ouI1ZlhzAYw
I know you’re worried about F/OSS and I completely understand that things on Android are not good enough. We probably want to have things open-source so that development can go faster.
I don’t know if the shown tech in the video above is open-source — probably not.
If you want to see the very raw beginnings of something more open, there’s this video (from India):
https://www.youtube.com/watch?v=wbyoTp0O2eM
Fascinating, but my Hindi is enough only for recognizing “Namaste”.
Of course, in an ideal world, we’d have lip reading by Linux tablets…
Now, excuse me if I’m totally ignorant of this subject, but why don’t people use the electronic larynx and a PC to recognize words and then speak them with an artificial voice? That would be faster, me thinks.
“…why don’t people use the electronic larynx and a PC to recognize words and then speak them with an artificial voice? That would be faster, me thinks.”
We’ve made stunning advances in almost every technology in use by man. We have robots and orbiters doing out bidding and prepping for our on-site arrival on Mars. We’ve uncovered stuff even Sci-fi hasn’t dreamed of in the Mariana Trench. The unfortunate but dazzling leaps (pun not intended) in prosthetic technology is right out of Star Wars. We’ve beaten back cancer to at least give the patient time for the Next Big Thing in cancer treatment. We carry a full blown computer in our pockets and purses so we can tweet and twitch our way into a stupor and more than likely the most disruptive technology of all time is driving us to work while we read the paper or shave in the back seat.
And with all that, there are two technologies that haven’t budged much since their inception…Real Player and the Electro-larynx. Not only is the electric larynx capable of frightening a child back into the fetal position, it is, at best; barely understandable for many people. My neck tissue is so dense, that all different models I have tried are barely able to make me understandable. I agree, this should be something that should have been improved in the last 50 years, but sadly, it is not.
And I was wondering when you were peeking over my shoulder. I actaully tried what you suggest. The interpretation by the PC/software is hillarious. I’ll have to post it when I get enough of it ready for publication.
The front end for MaryTTS is coming along nicely and with my much improved swype/typing skills, I can get 54 wpm. I’m hoping to hit 80 by the end of the month. Now I have to learn to incorporate the offered suggested words to make it even faster. Once I get in the 80 wpm catagory, most conversations with me should be almost bearable.
I wish I could code.
Good luck to every person who will benevolently work on this project.
Thank you for writing this article. I am in the process of teaching myself raspberry pi and arduino so that I can build an affordable talking machine for my son. I am also enrolled in an upcoming voice acting class so that I can learn to record nice voices for these machines. Currently I am not in any group work, I would love to be part of a collaboration with others who want to make this dream a reality.
Muriel, thank you.
And you spotted it outright where others still struggle with what I am trying to tell them. In Linux/FOSS, the software made to interpret the voice and the voices themselves seem to both be bridges built to span a river. The problem is, the bridges fall 30 feet short of meeting one another. It’s not that this is a new or even relatively new problem. It’s been this way for years now.
As long as the voices and the TTS software remain two separate efforts, problems like I describe will remain unfixed. As long as developers improve their code but fail to inform other developers that important file paths have changed, problems like I describe will remain unfixed.
In the Universe of FOSS, these are fairly easy problems to fix. All we need to do to fix the majority of them is to talk with one another.
I’ve been called out for mentioning Mbrola…as I have seemingly mixed it into FOSS when it is not. Good eye, and thank you for bringing that up. My bad.
But My point is, we need the voices and the software to be one and the same, not something integrated by shoehorn as an after thought. We have people working on a front end for MaryTTS as their time allows. Hopefully by this time next month, we will have a working model, or beta software that completes the interface. While money is always an important thing, I think we often overlook the developer’s efforts as something offhand and not terribly difficult or demanding. Hopefully this effort will bring the developer and the efforts they make, into the spotlight, where they belong.