This chapter addresses general methodology and tips for designing a successful Voice User Interface (VUI), with an emphasis on best-practice design issues that exploit certain features of VoiceObjects. These features make it easier and more efficient to implement good design and integrate audio.
Although this chapter is not intended to be a comprehensive treatment of VUI design guidelines and techniques, several general principles of good design emerge in the course of discussing specific topics and specific objects provided by VoiceObjects.
While there are many important elements in a phone application, the Voice User Interface (the portion of the system that callers hear and speak to) has potentially the greatest impact on its ultimate success or failure. Callers will only use phone applications that both meet their needs and are easy to use. The process of creating successful Voice User Interfaces can be broken down into seven phases.
5. Recording and voice coaching
6. Audio design
In VUI design, as in any software project, requirements derive from four sources:

On the business side, VUI projects are truly multi-disciplinary – requiring participation from a number of stakeholders including Marketing, Business, IT and Customer Service. Each entity within a business will have its own goals and requirements, all of which must be clearly defined and communicated.
Truly user-centered dialog design can only be realized if caller demographics, behavior patterns and goals are thoroughly analyzed. For the design of any product to be successful – whether its for an IVR or a ballpoint pen – you have to understand who's going to use it and how it's going to be used.
The tasks to be performed must be understood, including how individual tasks relate to each other.
All of the above are, in turn, subject to constraints of technology, time, resources, etc.
Lastly, any conflicts between these four requirement sources must be resolved.
The hear & feel of an application encompasses not only the persona of the voice saying the prompts, but also the verbiage itself, any other audio, even the very logic of the dialog flow. For this reason, it is important to define the hear & feel early in the project.
The more obvious aspects of the hear & feel include:
· Single-persona or multi-persona approach
· Male or female voice
· Persona character definition
· Style of speech (word choice, level of formality, etc.)
· Use of earcons and music, i.e. non-speech-audio.
Less obvious, but still important, aspects include strategies for event handling, making help available, handovers to agents as well as feedback on and confirmation of recognition results.
The VUI provides a business with an opportunity to leverage current investments in branding and corporate image, and to further differentiate itself from competitors. The hear & feel should reinforce the brand.
The dialog design models the interaction flow and responses to possible events. It’s an application’s presentation of functionality. Dialog design follows from logical and linguistic criteria, as well as from callers' mental models.
In a banking application for example, there's an unavoidable logic in collecting a bank account number before providing an account balance. It makes sense to offer the more specific term mortgages before the more general term loans in a menu if these options route callers to different departments (otherwise mortgage customers who barge-in as soon as they hear loans will be misrouted).
When asked for a car model, say, when getting an automobile insurance quote, different callers will respond in different ways, depending on factors like how they think about car makes and models or how interested and informed they are about cars. They might respond with just the make (e.g. Toyota) or just the model (e.g. Corolla) or both. A well-designed system will accommodate all these responses gracefully; that is, with appropriate follow-up questions that avoid making callers feel like they've made a mistake.
The prompt design phase involves composing all the verbiage both for eliciting input from the caller and for providing system status and content information.
Stylistically, prompts should be clear, to the point, and not too wordy. They should be conversational – reflecting spoken, not written language. They should be written so that the resulting logical dialog will flow naturally. And finally, they should be consistent with the defined hear & feel. The reasons for this go far beyond cosmetics. A conversational style leverages the most natural way for people to communicate. The goal is to make the dialog as intuitive for the caller as possible, thereby alleviating the need for extensive – and wordy – explanations and instructions.
More about prompt writing guidelines can be found in the section on Collecting Caller Input.
Analogous to a GUI (Graphical User Interface), which uses color, layout, images, fonts, etc. to draw the user's focus visually, a Voice User Interface draws the user's focus audibly. The professional voice talent who records the application prompts is directed to do this by pausing at just the right time, slowing down or speeding up, raising or lowering the pitch of his or her voice, or emphasizing certain words; all of which are referred to as prosody.
Prompt recording also demands that the voice talent expresses the appropriate emotions. The same application can employ prompts which are meant to instruct, reassure, engage, invite, or encourage – to name only a few.
During the recording session, the voice coach provides the context and emotional intent of each and every prompt. This is critical because in a typical recording session, the prompts are presented on the recording script in isolation, outside the context of the dialog, and possibly out of sequence. Most of the time, the voice talents do not have the benefit of a comprehensive overview of the application and its function, for which the prompts are being recorded. The voice coach is also responsible for keeping the voice talent in character and consistent with the pre-defined hear & feel.
Audio design is concerned with the creation of non-speech audio. This includes branded jingles, landmarkers, earcons and background music - see section on non-speech audio. Care must be taken that any audio is suitable for the telephone bandwidth. And of course, the audio must be consistent with the corporate brand and the application hear & feel, and fit within a consistent sound idiom.
Usability testing provides a low risk preview of the performance and caller acceptance of your VUI design. Commonly addressed issues are the validation of assumptions made about callers with respect to their behavior, knowledge, preferences, etc.; the appropriateness of the hear & feel; and the grammar coverage of caller input. In addition, there may be open questions about the VUI design or other usability goals that require scrutiny.
Usability testing addresses the quality of the caller’s experience with the system. In usability testing real callers, i.e. test participants who are representative of the targeted calling population, interact with a realistic version of the application to complete typical tasks. This produces data on how successful callers are in using the application, as well as data on their reactions to and opinions of the system.
More than one iteration of testing and redesign is recommended, since one usability issue may overshadow or distract callers from others.
Certain application behavior for event handling, dialog navigation and tuning can be defined at the Module object level. That behavior, unless overridden locally, will be inherited by all the objects specified within that Module object. Inheritance brings with it not only the efficiency of having to specify module-wide behavior only once, but also ensures that the specified behaviors will be consistent from object to object. This consistency will foster predictability, which will make it easier for callers to become comfortable using and co-operating with the application.
For a detailed description on event handling, navigation, and tuning refer to the respective sections in the Object Reference.
For a detailed description of the Module object see the Object Reference.
Standard navigation provides functionality for callers to move back one step in the dialog, to move forward from a backward step, to have the last output repeated, or to skip the current object and move immediately on to the next one in the sequence.
The actual commands which control standard navigation need to be specified in the grammar for each navigation behavior, and any commands advertised to the caller should be consistent with the overall hear & feel of the application. For example, synonyms for “repeat” might range from anything from “Say that again, please” to a casual “What?”
The command activation output (e.g. Activation – Backward) provides valuable feedback to the caller as to what command was recognized. As well as offering a smooth transition between prompts, it alleviates any confusion caused by a potential misrecognition. The caller knows what the system understood and why it's behaving the way it is.
Hyperlink commands give callers additional control over the traditional directed dialog. When defined at the module level, hyperlinks create global commands that allow callers to jump to specified places in the dialog – along the analogy of Web-based hyperlinks – or to customize their experience, for example, by using a command to activate alternate persona or functionality layers.
Typical examples for using custom navigation are general hyperlinks to the main menu, pausing the application, transferring to a customer service representative, or ending the call.
Careful consideration should be made as to which commands should be confirmed and when. Especially in cases where the cost of misrecognition is high, as with a Cancel command that could erroneously abort an interaction such as a booking process, the inclusion of a confirmation step is usually recommended for the application as a whole. But a state of the dialog flow that is rather noncritical, for example the start of a process, might feature a different, local behavior for the Cancel command: not performing a confirmation step if the recognition confidence were high enough.
The ability to pause the application is useful to callers who need time to complete some extraneous task like looking up an account number or grabbing pen and paper, or who find themselves in a situation where their full attention is momentarily demanded elsewhere, like in a moving car. Callers who use a pause command in one state of the dialog flow may assume that this functionality is available everywhere, so it makes sense to define this custom navigation at the global level, for consistency. When processing a Pause object, the regular dialog flow will be suspended until the configurable wake-up command is recognized from the caller (e.g. “Continue” or “I'm ready”). As with any navigation behavior, feedback to the caller – in the form of the Pause object´s welcome message, looped pause message, final pause message and return message – is important to keep the caller informed about what the application is doing and listening for.
@8 Tip: Not all applications provide 24x7 or even any access to live agents, yet callers may still attempt to reach a CSR (Customer Service Representative) by pressing 0 or with commands like “Operator” or “I want to talk to a person”. Custom navigation behavior should always be specified for such requests so that, at the very least, they can be handled gracefully.
Example: ”I understand you're trying to reach an agent. Since it's after hours...”
Event handling specifies how to proceed within a dialog in the case of any exceptions. Examples of such exceptions are internal errors, hang ups, the caller responding with something that is not recognized, or the caller not responding at all. Inheritance allows you to define event handling behavior at the module level and to have that behavior apply by default in all the subordinate objects. The design decision of whether to define event handling behavior globally or locally rests on whether the application needs to react differently depending on the current state of the dialog flow. For example, the first occurrence of an ASR No Match event might warrant the generic behavior of simply prompting the caller to repeat his or her input, just as someone might do in a conversation with another person.
|
Speaker |
Utterance |
|
Frank |
Do you have the time? |
|
Bob |
Yes, it's six twenty. |
|
Frank |
Pardon? |
|
Bob |
Six twenty. |
|
Frank |
Thanks. |
On the other hand, if someone did not receive a satisfactory response after the generic request to repeat, he or she would most likely rephrase the original question, perhaps with more detail.
|
Speaker |
Utterance |
|
Ray |
What’s the model number? |
|
Carrie |
L X four nine one. |
|
Ray |
Pardon? |
|
Carrie |
L X four nine one. |
|
Ray |
No, the model number should be an eight-digit number. |
|
Carrie |
Oh. I see it. Seven three zero eight three zero zero four. |
|
Ray |
Thanks. |
This last example shows the equivalent of the first and second occurrences of an ASR No Match event. After the second No Match, Person 1 provides additional details relating specifically to model numbers in an implied question, in order to elicit the desired response from Person 2. In a Voice User Interface design, such context-dependent behavior would be defined locally, within the Input object that collects the model number.
For some more information on event handling see also the section on Event Handling Strategies below.
For a detailed description on how to configure event handling definitions see the Object Reference.
For further explanation on event handling best practice see Chapter 8 – Advanced Event Handling.
Various objects are available for collecting input from the caller. The choice of which type of object to use depends on the nature of the information being collected as well as how the information will be used to determine the dialog flow. Some prototypical uses of these objects are discussed here.
For further information see also Chapter 3 – How to Interact with the Caller.
Menu objects are used when the dialog flow needs to branch for different handling based on a selection the caller makes. Depending on the nature of the options, especially how familiar they are to the caller, they may or may not be listed individually in the initial output, as shown in the following examples.
|
Object - Caller |
Dialog Flow |
|
|
Would you like to pay by check, credit card or ATM debit card? |
|
Object - Caller |
Dialog Flow |
|
|
How would you like to pay? |
When the options are presented to the caller, most menus contain anywhere from two to five options. Although it ultimately depends on how cognitively demanding the options are, any more than four or five menu options risk overburdening callers' abilities to remember and mentally process what their choices are and how they are supposed to respond.
For a detailed description on the Menu object see the Object Reference.
The Input object is used generally to request information from the caller, and then to assign that information to one or more Variable objects.
When it comes to collecting information like a telephone number, a person's smoking preference or a type of cuisine, it's natural to ask for one piece of information at a time because that's what callers will tend to provide in response. In grammar terms, this would be single-slot input.
Other types of information – for example, the year, make and model of a car, or the date and time of an appointment – naturally go hand-in-hand for many people when they're providing that information. If you ask for one part, you may get more than one in return. If you ask for more than one part, you may at first get only one back. In this second case, we talk about multi-slot input.
|
Object - Caller |
Dialog Flow |
|
|
What's the year, make and model of the car you want to insure? |
|
|
I drive a Ford Taurus. |
|
|
[Only two of the three slots are filled, so ask a follow-up question] And the year? |
|
|
Two thousand three. |
|
|
The premium for a two thousand three Ford Taurus is... |
|
Object - Caller |
Dialog Flow |
|
|
Right now we're booking appointments for Wednesday and Thursday. Which day would you prefer? |
|
|
Wednesday. Around three o'clock. |
|
|
[Two slots were filled, not just the one asked for. Skip question asking for the time and continue with confirmation.] I can schedule you for Wednesday at three thirty. Is that good for you? |
Intelligent, user-centered designs are flexible enough to accommodate both scenarios, and do so without making the caller feel like – or even aware that – he or she didn't answer the question as asked.
For a detailed description on the Input object see the Object Reference.
The Confirmation object is used to present information that was collected in the preceding dialog, giving the caller the opportunity to accept or reject it and, if necessary, to correct it. The confirmation step can be used to confirm individual caller inputs immediately after each is collected, or to confirm a summarized block of items at the end of a logically grouped input series.
What to confirm depends on the nature of the information. As a general guideline, you want to confirm an item when the cost of making an error is high, as it would be for something like a callback telephone number or a Cancel command which could prematurely end a process of interaction. When to confirm can depend on where the caller is in the overall dialog flow. If proceeding with the dialog would put the caller in a position that would be difficult to recover from (if the information were incorrect, or the caller changed his or her mind), then a confirmation step is in order.
In the first example below, the lack of confirmation puts the caller in an awkward situation.
|
Object - Caller |
Dialog Flow |
|
|
What is your marital status? |
|
|
Unmarried. [misrecognized as married] |
|
|
And what is your spouse's age? |
On the other hand, the next example shows a confirmation step that summarizes the details of a transaction for caller acceptance, before a bank transaction is executed.
|
Object - Caller |
Dialog Flow |
|
|
So, you want to transfer three hundred dollars from your savings to your checking account. Is that right? |
|
|
Yes. |
|
|
Alright. Just a moment. |
Another time to confirm is when there's insufficient recognition confidence in matching the caller's utterance to the grammar. In other words, the recognizer is not "sure" what the caller said.
|
Object - Caller |
Dialog Flow |
|
|
Would you like 'billing', 'sales' or 'technical support'? |
|
|
[cough] Sales. |
|
|
[insufficient confidence, so confirm] Did you say "technical support"? |
|
|
No. |
Even in this case, though, a confirmation step is not always the best strategy. Considering again the cost of making an error, it may take less time to follow through with the error than to ask the additional question. In the case below where a command is recognized but not with high confidence, it would have been quicker to have accepted the recognition hypothesis and to have simply gone ahead with repeating the reference number.
|
Object - Caller |
Dialog Flow |
|
|
Your reference number is five seven seven four eight. |
|
|
Repeat. [medium confidence] |
|
|
Did you say you wanted that repeated? |
|
|
Yes. |
|
|
Once again, that's five seven seven four eight. |
In the following case of a misrecognition, it may take less time to carry out the misrecognized command.
|
Object - Caller |
Dialog Flow |
|
|
Your reference number is five seven seven four eight. |
|
|
Sweet! [recognized as Repeat] |
|
|
Once again, that's five seven seven four eight. |
|
Object - Caller |
Dialog Flow |
|
|
Your reference number is five seven seven four eight. |
|
|
Sweet! [recognized as Repeat] |
|
|
Did you say you wanted that repeated? |
|
|
No! |
The above examples are offered not to categorically discourage the designer from integrating confirmation into the dialog flow logic, but to raise an awareness of the trade-off between convenience and accuracy. Confirmation will always be an important part of good VUI design.
@8 Tip: Sensitive information such as a PIN requests different confirmation strategies. A PIN must not be repeated, just like GUI's encrypt PIN input. Therefore matching the PIN with a second identification, for instance invoice number, date of birth, or phone number, is necessary.
Care must be taken so that the No Input and No Match prompts in a Confirmation object (or any yes/no context) maintain their naturalness. Too often, prompt designers sacrifice the best practice of keeping the application dialog conversational in their effort to guide the caller, as the example shows.
|
Object - Caller |
Dialog Flow |
|
|
Did you say you wanted to transfer to an agent? |
|
|
[No input] |
|
|
Sorry, I didn’t hear you. To transfer to an agent, say ‘yes’. To continue here, say ‘no’. |
The reason the above example is less than optimal is because, from a dialog perspective, the mapping between the user input (yes or no) and the resulting application behavior is an artificial one, in the same way pressing a DTMF key is artificially mapped to a non-numeric selection (e.g. For departure information, press 2). Because the caller is asked to respond in an artificial and non-conversational way, errors leading to additional caller frustration are more likely to occur. A better prompt-writing style would be the following, which provides the explanatory verbiage the caller needs while still being conversational and natural.
|
Object - Caller |
Dialog Flow |
|
|
Sorry, I didn´t hear what you wanted to do. I can either transfer you to an agent, or continue helping you here. Would you like to speak with an agent, yes or no? |
For a detailed description on the Confirmation object see the Object Reference.
The List object allows callers to listen to and navigate through a list of items and perform various actions on an individual item, e.g. reading, deleting or forwarding a message in an e-mail reader, or selecting an item from a shopping list. Supplementing the full description of the List object in the Object Reference, some points specifically related to VUI design are discussed here.
A browsable list can be thought of as a table with one or more rows and one or more columns. Each of the cells of the table represent an item that can be selected – either actively or passively by the caller – and acted upon. The task of the designer is to make these cells accessible to the caller, enabling him or her to navigate between them.
The optional messages to be specified in the Message section are Welcome, Begin of List, End of List and No Data.
The Welcome prompt plays as the caller enters the list. It typically introduces the contents of the list or table and instructs the caller on how to navigate between the various elements.
|
Object - Caller |
Dialog Flow |
|
|
Here you can get quotes and news from |
Since the instructions might tend to be rather wordy, depending on the complexity of the information presented and the navigation commands that are active, you might offer abbreviated variants when returning to the same dialog state.
|
Object - Caller |
Dialog Flow |
|
|
Here's your stock portfolio again. For a reminder, say 'what are the commands?'. |
In this case, the help prompt, instead of the welcome message would remind the caller of the available navigation or selection commands.
The Begin of List and End of List prompts keep the caller oriented with regard to the boundaries of the list, which is particularly valuable in the purely auditory environment of an IVR. Prompts like these and the No Data prompt fall into the category of system status prompts that inform the caller about where he or she is in the dialog or explain why the application is behaving in a certain way. For example, the End of List prompt might look like the following.
|
Object - Caller |
Dialog Flow |
|
|
That was the last company in your portfolio. Shall I repeat? |
The List object easily accommodates designs with both relative navigation by means of generic, standardized commands like back, previous row, next item, etc.; and navigation using direct access commands which are specific to the row and/or column names, like the ones in the table below.
|
Grammar |
Name |
Quote_today |
Quote_ |
News |
Portfolio |
|
ibm |
IBM |
17.5 |
16.3 |
None. |
56 |
|
apple, apple computers |
Apple |
23.8 |
21.4 |
Apple will release their new operating system tomorrow. |
127 |
|
dell |
Dell |
56.1 |
58.0 |
Dell’s newly introduced laptop line is flying off the shelves. |
83 |
Callers could navigate directly to a particular company row in the portfolio table by saying IBM, or to a particular column within the current row with the command news. See the List object in the Object Reference for the available standard commands. It is up to the designer to decide which of these commands to offer, since too many command choices may confuse callers, and callers are not able to memorize the commands anymore.
Designers should also avoid confusion caused by different utterances activating similar commands which mainly differ by their context, like the standard navigation command backward defined in the Module object (see Global Behavior) as opposed to the list command previous. It is recommended to use the same utterances for standard or custom navigation commands and for the corresponding local navigation commands defined in the List object. Grammars defined in the List object have precedence over the standard navigation grammars. So, for example, when the same utterance can activate the standard navigation command backward as well as the list command previous, the list command is processed.
With the selection commands, designers specify which commands callers can use to either add or remove items from the selection cart, or to perform some other action. Examples are to select an entry from a list in a shopping application or to delete an e-mail from your inbox in an e-mail reader application.
|
Object - Caller |
Dialog Flow |
|
|
Next. |
|
|
A new message from Main Street Home Mortgage. |
|
|
Delete. |
|
|
Message deleted. |
i8 Note: Since the selection command will act upon the current List item, care should be taken in designing how the list is played, so that callers have enough time to react between List items. If the option Enable auto-advance to the next row is selected and the application has already moved on to the next item before the caller has spoken the command, then the wrong item will be selected. To avoid this, crucial events like deleting an item should be confirmed.
Although an intelligent, user-centered VUI design more than compensates for less-than-perfect ASR (Automated Speech Recognition) accuracy through its event handling, the designer can also boost recognition accuracy by the way Menu items are worded and recorded. Well-written grammars provide flexibility by containing synonyms and other variations for what callers can say, but prompts must be more economical. They can't present every possible response to the caller. An important rule of thumb in VUI design is to always make it clear to the caller how he or she is supposed to respond. To accomplish this, designers often provide keywords for the caller to use when an acceptable response may not be obvious. By not doing this, you risk having callers respond in a more free-form manner than can be practically anticipated and written into the grammar.
|
Object - Caller |
Dialog Flow |
|
|
Bad: Welcome to Prime Travel! I can help you select from over five hundred travel destinations, get information about hotels in the destination of your choice, or book economy, deluxe or luxury rooms. What can I help you with today? |
|
|
Uh...I want a hotel room for July ninth through the fourteenth in Albuquerque. That's for two people. |
You could certainly write a grammar that could handle this particular caller's response, but it would be impossible to account for all the variations that you would receive with such a vague prompt.
|
Object - Caller |
Dialog Flow |
|
|
Better: Welcome to Prime Travel, where you have access to over five hundred travel destinations! Would you like to 'pick a destination', 'get hotel information' or 'book a room'? |
|
|
Book a room. |
In this second example, the quotation marks indicate that these phrases are billboarded, or spoken as keywords that the caller is expected (though not necessarily required, depending on the grammar) to repeat verbatim. This example should be seen not so much as restricting what the caller can say, but rather as simplifying things for the caller through supportive instruction.
When designing keywords, or any kind of menu verbiage, it will also help callers to suggest words and phrases that are easy to say. Expressions that are hard to pronounce, either because they are phonetically challenging like a tongue twister or because they contain unfamiliar words, will only cause callers to misspeak and recognition will be likely to fail.
Note also that pick a destination, get hotel information, and book a room are distinct sounding from one another. As a counterexample, consider a menu that contains sales and service as keywords. Both words take about the same time to say, both start with a s sound and they end with acoustically similar z and s sounds respectively. Recognition accuracy will be improved on these two keywords if they are expanded to become more distinct: sales desk and service department, for example. Keep in mind that Menu items should not only be distinct from each other but from hyperlink commands as well. Consider the confusability of Ford and forward, for example.
Within VoiceObjects any output played to a caller is represented by an Output object. Outputs consist of prompts (see below) and may contain text that is read out through a text-to-speech engine, video, sounds or prerecorded text files, possibly containing various dynamic content as well as Silence objects.
Application output consists of prompts. We define a prompt as a unit of information, which is played by the system as acoustic output (speech or other audio), in order to communicate with the caller. As such, a prompt can be one of three types: a request for input, a status output or an information output.
· Requests for input typically take the form of asking the caller to make a selection, say from a menu or in choosing which element of a summary might need correction; asking the caller to make a decision (e.g. yes or no); or asking the caller for a piece of information, like a telephone number, in what is also referred to as "form filling".
· Status output presents the current system status to the user. This can be feedback to the caller about errors, summaries of collected data for eventual confirmation and phrases like Just a moment which communicates to callers that they can expect a delay while the system processes data. Also in this category are landmarkers, which provide information about where callers find themselves within a dialog. These are often earcons, but announcements like You're back at the main menu are also common.
· Information output could either be system-initiated or caller-initiated. Examples of the first case are greetings (assuring callers that they've called the right number) and instructions and tips for using the system. Caller-initiated information means information that the caller requests: a help message or a weather report, for example.
Spoken output can consist of text-to-speech (TTS), recorded speech or some combination of both. TTS is sometimes suited for playing dynamic content, like e-mail messages. Recorded speech, on the other hand, i.e. prompts recorded in a studio with a professional voice talent and processed into individual audio files, offers the precision and full expressiveness of the nuances of human speech in communicating not only content but also personality and emotion. Only recorded prompts create an optimal hear & feel.
Other types of audio are commonly used in phone applications to supplement the speech output, all of which contributes to the overall hear & feel. Non-speech audio communicates information – often subconsciously – as callers learn to recognize sounds and make associations. It can be more succinct than speech, and if used appropriately will add a liveliness and depth to the design. We categorize non-speech audio in four ways for illustrative purposes, but overlap between the different categories certainly exists.
· Jingles are tied in with corporate branding. One typical use of a jingle is in the greeting. Callers will recognize the jingle from company marketing. They will know immediately that they've called the right place and will bring their expectations about the brand to the call.
· Landmarkers serve to orient callers as to their place in the dialog flow. Typical landmarking audio alerts callers that they have returned to a main menu or have successfully navigated to a subdialog. An example of this second case would be traffic sound effects as a segue into traffic information.
· Earcons draw attention to specific situations in the dialog flow. A rising tone may suggest the pop up of a message, and a falling tone the end of that message. Earcons are also used as timing signal for the start of a recording, or as a stylistic variant to confirm valid and invalid caller identification.
· Ambiance refers to background music or sound effects to evoke or reinforce a mental image of the caller's mind. This might take the form of restaurant noises (low talking, glasses and silverware clinking) for a restaurant locator application, for example.
State-of-the-art Voice User Interface designs employ multiple prompt variants to make a dialog non-repetitive and conversational. Natural and dynamic interaction adds up to a strong hear & feel. Three of the ways this is accomplished are discussed here.
One, very straightforward way is to having different prompt variants based on the input mode: voice or DTMF. This then allows you to specify more context-sensitive responses, for example, in a No Match situation.
|
Object - Caller |
Dialog Flow |
|
|
|
Input mode = voice |
Sorry, I didn't understand what you said. Please say your eight-digit account number again, one digit at a time. |
|
Object - Caller |
Dialog Flow |
|
|
|
Input mode = DTMF |
Hmm, the number you entered wasn't a valid account number. Please enter your eight-digit account number again. |
Another common way of introducing variant prompts into your application is to have different prompts based on the occurrence level. When the caller reaches a certain state in the dialog the first time, he or she needs some introductory information. Any further running through, however, should be quick and reduced to basic information. See the following example of reentering the main menu.
|
Object - Caller |
Dialog Flow |
|
|
|
Occurrence =1 |
Main menu. For balances and account histories, say 'my account'. To move money between accounts, say 'transfer funds'. For all loan inquiries, say 'loans'. |
By the time the caller reaches the main menu again, the caller knows the options already. Simply offering the keywords is perfectly sufficient.
|
Object - Caller |
Dialog Flow |
|
|
|
Occurrence >=2 |
Back at the main menu. What can I help you with now: my account, transfer funds, or loans? |
A third way to introduce prompt variation is through the use of random prompts. That means that when it comes time for the application to play a given prompt, there are multiple versions of that prompt for the system to select from. Each time the application needs to play that particular prompt, a different variant will be selected until all have been exhausted. Randomized variants make the most sense for prompts that get repeated during a single call, or for all the prompts if people will be calling an application often.
Creating prompt variants does not necessarily mean different verbiage (unless you're using TTS throughout the application). You can also create variants by coaching your voice talent to alter the delivery (i.e. reading) of the prompts, and this will prevent your application from sounding robotic and repetitive. Suboptimal wordings should never be used just for the sake of variety, and so different readings can help to fill out the set of variants.
Here are some examples of variants for the second occurrence of a No Match event.
|
Object / Variant |
Dialog Flow |
|
|
|
1 |
Sorry, I didn't understand. |
|
|
2 |
Excuse me. I didn't understand. |
|
|
3 |
I'm sorry. I didn't get that. |
|
|
4 |
I wasn't able to understand you just now. |
|
|
5 |
Sorry, but I didn't quite get that. |
Variants based on input mode, occurrence and randomization can all be used in combination with one another to create a rich, yet easily managed inventory of prompts.
As discussed in the Event Handling paragraph within Global Behavior above, the decision to define event handling globally or locally from a design standpoint rests on whether the resulting application behavior needs to be specific to the context or not. As a representative example for VUI issues in event handling, the case of ASR – No Match events provides different context-independent and context-dependent behaviors, depending on the occurrence level. Similar strategies can apply to other standard events, although any design decision should ultimately be based on the type of application, the particular needs and usage patterns of the callers, etc.
In the case of the first occurrence of an ASR - No Match event, all that may be required to get callers back on track is to give them another chance to respond. To this end, a generic prompt like "Sorry?" has the advantage of being very brief and it mimics what most commonly happens in human-to-human conversation when one person wants the other person simply to repeat what he or she said. Such a strategy will work well when callers knew how to answer but simply misspoke, they had to think a moment then spoke too late, they were briefly distracted by some side conversation, or they coughed while speaking-to name a few possibilities.
|
Object - Caller |
Dialog Flow |
|
|
For verification, please tell me your date of birth. |
|
|
Sure. January sev-[ clears throat ] seventeenth, nineteen seventy-two. |
|
|
[ No Match 1: ] Sorry? |
|
|
January seventeenth, nineteen seventy-two. |
|
Thanks. Just a moment. |
The most efficient way to implement the context-independent behavior for the above scenario is to define it at the root module level, i.e. globally. Since we're defining it once for the entire application, and yet we want to avoid the robotic monotony of playing the same prompt for every ASR – No Match first occurrence, we have the ideal situation for using random prompting. The prompts that are used should naturally be consistent with the application's hear & feel, but might range anywhere from "What?" or "Come again?" to "I beg your pardon?" The number of variants to include in your design should be influenced by the maximum number of times callers might activate this event on a single call. The VoiceObjects platform will play all the variants (as events warrant) before repeating any prompt.
On the second occurrence of an ASR – No Match event, callers have been given the opportunity to repeat their response, but the second utterance is still not recognized as being in the active grammar. This may indicate that the callers do not understand what is being asked for, or are unsure how to respond. Callers will likely benefit from additional, context-dependent instructions or explanations in this case.
|
Object - Caller |
Dialog Flow |
|
|
How many child tickets would you like? |
|
|
Uh...I'm not sure. My daughter is four years old. |
|
|
[ No Match 1: ] Pardon? |
|
|
My daughter is four. Does she need a ticket? |
|
|
[ No Match 2: ] Sorry, I still didn't get that. |
|
|
Zero |
|
|
Alright. How many adult tickets then? |
The generic (context-independent) reprompting of the No Match 1 event in the above example does not help the caller to recover but the No Match 2 prompt does, with locally defined prompting that addresses a common reason for out-of-grammar responses at this input state. In this example, the caller failed to answer the question not because he didn't know how many children he had, but because he didn't know what the ticket policy was regarding age.
Note also the introduction of DTMF fallback into the No Match 2 prompt, i.e. "...or enter it on your telephone keypad." Although not the case in our example, the caller could have known perfectly well what answer to give and may have been saying a number. For whatever reason though, the system still may not have been able to match the caller's utterance to the grammar. Reminding the caller that DTMF input is an option will help the caller to recover if the No Match is resulting from the caller not saying the number clearly, the caller adding unanticipated extra verbiage to the number, or background noise disrupting recognition.
Although the caller in the above example would have been better served if the context-dependent reprompting had occurred after the first No Match event, the designer often has to consider the calling population as a whole. If relatively high percentages of callers answer "Zero" to this question anyway, or as repeat callers are already familiar with the age policy, then we don't want to burden these callers unnecessarily with long explanatory prompts. The greater number of callers (or maybe the most important ones) are better served by simply being giving a chance to repeat their response.
On the other hand, if a significant number of callers were being confused by the original question, then you would want to override the global behavior and provide a context-dependent prompt after the first occurrence. (The initial prompt might also be rewritten to be more explanatory, along the lines of "How many tickets for children aged six through twelve would you like?")
On the third occurrence of the ASR – No Match event, three attempts to elicit input from the caller have failed, and so many applications transfer the caller to a call center agent if one is available. Other applications might politely end the call, perhaps suggesting that, if there's a lot of noise where they are, callers might call back when they're in a quieter environment.
The rationale here is that further reprompting increases the chance of a valid input only little whereas the increase of frustrating the caller by more No Matches is rather high. Since this behavior would be uniform across the application, it should also be defined globally. Of course, the maximum level of event handling is dependent on the context of the application. Three levels is a rule of thumb. For some applications a 4th level may be reasonable, others may only request two levels.
The following table lists other events, which impact the VUI design along with certain issues to consider when defining event handling. The list of design considerations cannot be expected to cover all eventualities, but rather is intended to illustrate some of the more common issues that designers encounter.
For more information on event handling strategies also refer to Chapter 8 – Advanced Event Handling.
|
Event |
Design Considerations |
|
ASR Max Speech Duration |
Either the caller is speaking too long or there's ambient noise, like a loud television in background, being misinterpreted as caller input. On the first occurrence, you may want to play a prompt that makes it clear to the caller exactly what responses |
|
ASR No Input |
The caller may have been distracted from hearing the prompt or a false barge-in may have interrupted the prompt before the caller could hear it (and know how to respond). Or the caller may have answered, but too softly for the recognizer to detect. In this case, a brief reprompt on the first occurrence can be expected to get the caller back on track. E.g. "Sorry, what city?" The second occurrence may indicate that the caller does not know how to respond, either because of not knowing the answer or having become lost in the dialog flow. Here, you might reprompt with short explanations of the options or offer Help as a way of having the options (more thoroughly) explained. You may want to tell callers about pausing the application to allow them time, for example, to look up an account number. Additionally, you might remind callers of navigation commands like main menu or back up in case they may have gone down the wrong dialog flow path. |
|
ASR No Input/Match |
Used for cases when the behavior for No Input and No Match events should be identical. |
|
ASR No Match |
Discussed in detail above in section Example: ASR No Match event. |
|
ASR No Match (DTMF) |
Used when the event handling after DTMF input differs from that after voice input. The difference is most obviously expressed in the prompt verbiage, as in "I didn't understand what you entered." The DTMF input that the caller just gave is invalid, meaning the caller 'fat fingered' what would have been valid input, entered an invalid string of numbers, or entered a single number for which there is no valid menu or command mapping. Simply reprompting the caller should remedy the 'fat finger' case. Callers entering an invalid string may benefit from being told how many digits to enter or where to find the valid number (e.g. "You can find the eight-digit invoice number below the date in the upper right corner of your invoice."). In the case of an invalid single-digit entry, callers may be entering 0 in an application that does not offer agent transfers, or tying other DTMF commands that they may have learned from other applications or earlier versions of the current application. These attempts should be handled gracefully, acknowledging what the caller may be trying to do but instructing him or her of the available options. For example, "I understand that you're trying to reach an operator, but..." etc. |
|
ASR No Match (Voice) |
Used when the event handling after voice input differs from that after DTMF input. The difference is most obviously expressed in the prompt verbiage, as in "I didn't understand what you said." |
|
Caller Cancel |
An important consideration here is that the Cancel command activates consistent behavior across the application. Will the application back up a step, return to the top of a submenu or return to the main menu? If returning to a high-level menu, might the caller become frustrated if several steps need to be redone? If so, you might give the caller a choice of where to continue from. Considering that the cost of a misrecognized Cancel might be high, you may want to always confirm this command. |
|
Caller Exit |
Commonly activated by the more standard synonym Goodbye, which needs to be included in the activation grammar. Confirmation strategy should consider whether the caller is at a natural end point in the dialog flow or not. An Exit command in such a place might be confirmed only if the recognition confidence were low, whereas the same command in the unlikely place of the middle of a transaction might be confirmed in all cases, so as to prevent the call from ever being ended mistakenly. |
|
Caller Exit/Cancel |
Used in cases where the global commands Exit and Cancel activate the same application behavior. |
|
Caller Help |
The global command Help activates the help event which typically plays a context-dependent prompt, defined locally. Help prompts provide more detail and explanation than other prompts, remind callers of other navigation commands, or present additional options to the caller. |
|
Error Connector |
Defined if a separate behavior for connector errors is desired. Typically, a prompt is played to inform callers that certain information or certain transactions are not available. Callers might be given the option to continue with other parts of the application or to be transferred, or asked to call back later. |
|
Error Internal or Error Media Platform |
Typically, a generic prompt is played to inform callers that the application is having some sort of technical difficulty. Callers might be transferred or asked to call back later. |
Layers are an integral part of the VoiceObjects platform. They allow designers to easily separate the core application logic from the way an application is presented to the caller.
Among the countless uses of layers is the ability they give to designers to easily and efficiently customize and personalize the behavior, functionality and hear & feel of an application. For instance, a menu can offer additional choices for gold customers depending on specific profile data available from a back-end system, or certain application features can be disabled outside of normal business hours.
Users can customize their call experience, for example, by selecting which language to speak, which persona to interact with or which input mode they prefer. The application can personalize itself for the identified caller with things like tailored menus, geographically relevant information or different modes for novice and expert users. Each of these examples can be implemented as manual or automatic layers.
A detailed description on how to use layers including VUI aspects can be found in Chapter 7 – How to Use Layers.
For information on how to configure the Layer object see the Object Reference.