Imagine a computer that not only understands what you say but also says it the way you can understand. Imagine a computer that you can talk to the way you talk to humans and that responds the way humans respond. Imagine a computer that can read your thoughts and communicate with you as seamlessly as you’ve always wanted it to, since you saw 2001. OK, keep on dreaming. 2001 is in the past, yet still in the future.
But indeed a HAL of a lot has changed since 1968. Now we can build machines that can reliably understand spoken commands, whole phrases or sentences, and react accordingly: provide timetables, transfer money, book tickets, or provide assistance with any kind of problem we might have in today’s life 2.0 (man how I hate this 2.0 thing by now!). And those machines are increasingly built in a way that they use human patterns of communication, allowing for more or less free speech, interactive turn-taking, and relatively natural-sounding computer voices.
Over 9 months ago – man, time flies like an arrow (hey, we can even build machines that understand the ambiguity of this sentence; apparently already way back around 1968) – I wrote my first article on Natural Dialog Management (also check out the 11/05/2008 jam session on this topic). I promised I’d continue on this, so here I am. Today I want to write about how you can make a voice application adapt to the caller with regard to their “speaking style”, the vocabulary they use, “how they speak”. Why should you do this? Think of a doctor trying to explain what’s wrong with you. If he or she doesn’t adapt his or her vocabulary to yours, you might just as well stay home and google the symptoms yourself.
Here are some examples where Adapt-to-me (as we like to call it at Voxeo) makes sense in speech applications:
- If you are a provider of, say, landline telephony as well as high-speed internet, you might have callers calling into your helpline saying “I have problems with my Internet connection” at your first How-Can-I-Help-You input state. Your system might confirm this by saying “I understand you have problems with your DSL, correct?”. Your technology to provide internet might be DSL – but does your customer necessarily know that? How could she respond? Maybe with saying “No, Internet!”?
- Ever had the experience of giving out your phone number over the phone and hearing it back from your interlocutor in a way you didn’t even recognize your own number anymore? “My number? That’s six two nine three nine oh four.” – “OK, I’ve jotted that down, that’s sixty-two, ninety-three, nine fourteen?” – “Hang-on… let me think… err, yeah I think that’s it.”
- How do you say the expiration date of your credit card? If it was “12/12”, would you say “twelve twelve”, or “December twelve”, or “December two thousand twelve”, or …
No ouch this time. This is just to demonstrate that there are numerous ways to speak dates, and using the same pattern as the caller when repeating their input again can help improve intelligibility of the system, thus cause less frustration, thus fuel acceptance of the overall application, thus increase revenue…?
So I say you can have the computer say “internet connection” instead of “DSL”, and “six two nine three nine oh four” (not even “six two nine three nine zero four”), and even “twelve twelve” if that’s what the caller is inclined to say (maybe hastening to add a “that’s December, two-thousand and twelve”, just to confirm you have fully understood your caller). You will say: “How”? Let me explain.
VoiceObjects allows you to store the pronunciation of an utterance in a variable, along with its actual value. This is done through the grammar that enables the speech recognizer to understand the caller in the first place. This value is called the pronunciation value. There is no fixed format for how this value should look like; it is completely up to you. How to hand this value back to VoiceObjects Server from within a grammar is simple: you add it to the return value for the slot that is filled by the corresponding utterance, separated by a double-pipe (“||”).
When the server detects this “||” symbol (which is configurable through our media platform driver concept, by the way), it will parse the actual value out of it (“DSL”, as this might be the internal value required for further processing) and assign it to the variable, parse the pronunciation value and assign it as well. By the way, if you’re interested in this value during processing, you can retrieve it via the PRONUNCIATION(RefID) function provided by the Expression object.
What you do with this pronunciation value is straightforward, too: you hand it over to a formatting algorithm (via our Formatting Bus), which takes this pronunciation value (along with the “real” value, which is actually not needed for speaking the variable value back) and uses it to come up with the pronunciation when repeating the value in an output. Note how the grammar, in the above example, returns “internet_connection” as the pronunciation value; this assumes that there is a prerecorded prompt saying “internet connection” as the problem category. Your formatting algorithm would thus probably need to return “internet_connection.wav” as the audio file to use for playback. In fact, for this example you don’t even need your own formatting algorithm. The predefined formatting types utilize the pronunciation value instead of the actual variable value anyway. So choosing, e.g., TTA – Files or TTA – Complete as formatting types for your Variable object will make the platform use “internet_connection.wav” right away. Nice and simple.
Let’s have a look at the number pattern example now.
First, your grammar must be built in such a way that it can recognize single digits as well as number blocks. Usually, rules that match “one” up to “ninety-nine” suffice. The rest can be nested using smart grammar rule structures. In the tags that compute the value of what was said (as opposed to the words used), you need to add logic that also builds up the pronunciation value as the caller speaks (or rather: as the ASR engine computes the result). As an example, if the caller in fact says “sixty-two ninety-three nine oh four”, the slot return value computed by your grammar rules might be “6293904||62 93 9 oh 4”, which gets parsed as “6293904” for the actual variable value and “62 93 9 oh 4” for the pronunciation value. Your formatting algorithm might make a sequence of “62.wav 93.wav 9.wav oh.wav 4.wav” out of this. In fact, you could just as well use a predefined TTA algorithm for this again, e.g. TTA – Words, and it will do the job.
Last but not least, our famous sample application Prime Telecom, a telco self-service portal coming in three channels (voice, text, mobile Web), provides a sample implementation of Adapt-to-me with the credit card expiration date example I described above. Go check it out today! You can get all the software required to run this sample application for free at http://developers.voiceobjects.com. Go and impress your boss with what VoiceObjects can do for making your phone applications a much more pleasant experience, and your customers much happier. (Or maybe you ARE the boss? But hey – this mission is too important for me to allow you to jeopardize it…)
Oh, and if your boss tells you to implement this within your existing VoiceObjects app, check out the Input object documentation of our Object Reference (search for “pronunciation value”).