This chapter describes how to make best use of the layer architecture of VoiceObjects.
Layers are an integral part of VoiceObjects. They allow designers to easily separate the core application logic from the way an application is presented. Layers can influence aspects such as language or persona, but they can also change the application behavior itself. For instance, a menu can have additional choices for gold customers depending on specific profile data available to the application, or certain application features can be disabled outside of normal business hours. Defining different prompts for different phone channels is also done using the layer concept.
Layers attach to individual objects, the core components from which applications are built within VoiceObjects. They can be seen as orthogonal functionality to the dialog. Designers may stack any number of layers on top of each other when building their application. The individual layers are independent of each other, but can be freely combined to provide flexibility and dynamic behavior along various dimensions (see figure below).

Layers may capture any type of caller-specific or back-end-dependent information and influence the application accordingly. VoiceObjects offers a set of frequently requested functionality as system layers. Others will be implemented as custom layers. Examples for layers are:
· Language
Application language is a system layer provided by VoiceObjects. This enables applications to easily switch between different input and output languages without having to duplicate or modify the application logic itself.
· Input Mode
Input mode is another system layer. In the voice or video channel, use it to define different prompts depending on which input modes are allowed: voice or DTMF. In the text-based channels, the input mode is not relevant; by default, it is set to “voice”, while Infostore, the logging component of VoiceObjects Server, logs it as “text”.
· Channel
Channel is another system layer. It allows defining different prompts, grammars, or application behavior depending on the phone channel a service will be deployed on.
· Service levels
Service levels may vary depending on the time a call comes in, as well as on the specific caller. This may influence options such as the availability of human operators to transfer to, the ability to perform certain transactions on the back-end, etc.
· Caller preferences
Many modern applications allow callers to set their individual preferences regarding, for example, the persona they wish to talk to, which services they are most interested in, and whether they want more or less explanation. These preferences then influence the way the application interacts with the caller.
· Caller experience level
In addition to preferences set by the callers themselves, the application can also learn about the caller on its own. A typical use case is e.g. the distinction between novice callers who require more detailed explanations of certain features, and experienced callers who have used the application several times and already know their way around.
· Caller history
In much the same way that major Web sites such as Amazon utilize previous shopping behavior to present personalized recommendations to visitors, layers can be used to include information derived from the caller’s interaction history into an application. This can range from the simple case of playing an advertisement for an investment opportunity in case the caller’s account balance exceeds $5,000 to a complex re-ordering of menu entries based on an analysis provided by a CRM (Customer Relationship Management) system in the back-end.
When processing an object at call time, the server dynamically applies all relevant layers to it as depicted in the figure below, and generates the required dialog step.

Because the layers incorporate caller-specific information, their combined application to the underlying objects results in personalized experiences:
· Caller #1 calls in from his mobile phone. This activates a specific set of layers along with certain settings within those layers. This could mean e.g. that media platform settings are adjusted to potentially noisy environments, and that confirmations are sent out via SMS to the mobile phone.
· Caller #2 calls in from her home phone. This activates the same set of layers as for caller #1, but with completely different settings. This could mean that in this case the media platform settings are adjusted for a quiet environment and landline-quality speech. In addition, confirmations might be sent via e-mail.
· Caller #3 uses a multimedia capable mobile phone. In addition to using somewhat different settings than those for caller #1 this also activates a larger set of layers, enabling the use of additional channels (e.g. pushing picture messages to the caller’s phone).
Since the server performs the complex task of dynamically combining the individual layers at call time, this eliminates the need to manually consider all of the different combinations that may occur. As the number of individual layers and settings within each layer rises, the number of these combinations grows exponentially and quickly becomes unwieldy. Care needs to be taken that all layer combinations work correctly and reliably regardless of whether the system serves one caller or one thousand parallel dialogs. And finally, performance can quickly become an issue if the combination of layers is not implemented efficiently. When using the VoiceObjects layer architecture, all of these points are addressed implicitly and the designer can focus on the optimal application design instead of having to worry about implementation details.
The use of layers also significantly enhances the reusability of application blocks. Instead of building a separate copy when a slightly different presentation or behavior is required, a new layer can simply be added that retains the core of the existing functionality while adapting the things that need to be changed. This speeds up development and leverages the tested quality of existing modules, which in turn are based on the solid foundation of the objects themselves.
VoiceObjects provides two different kinds of layers, so-called manual and automatic layers.
Manual layers need to be switched manually during call time, typically activated by some command or choice made by the caller. The switch is performed by using the system function LAYER(layer, [state]) within an Expression object, which allows you to both retrieve the current state of a specific layer (leaving argument state empty), and explicitly set it to a new state (providing argument state as the state to switch to). Manual layers can only be in one state at a time. The most frequent example of a manual layer is Persona. The number of voices you want to have in your application corresponds to the number of states of this layer, one of which would be declared as the default (=initial) state. By using a Hyperlink object with a corresponding Expression object as its destination, the layer can be switched to a different state at call time, so that all following prompts are spoken by the corresponding persona.
Automatic layers change their states automatically, according to the status of the application during call time. Examples would be Caller Profile depending on back-end data, or Daytime depending on the current time of day. Automatic layers, as opposed to manual layers, may have more than one active state. The state definition is typically done by analyzing the values of variables or expressions, which usually change during a dialog, e.g. due to back-end access or caller inputs.
Within VoiceObjects the complete layer functionality is bundled into the Layer object. Layers are built by defining a new Layer object for each desired layer. Within the editor for this Layer object you can either create a manual or an automatic layer, by simply defining the necessary properties like the number of layer states you want to have. All Layer objects and their respective states are accessible within the folder Layers in the Object Browser.
Within VoiceObjects it is possible to directly link a special layer condition (e.g. layer Daytime = afternoon) into the Layer field, which is part of many objects like the Output or Input object.
For detailed information on the Layer object see the Object Reference.
To give an overview of how to best use some of the benefits that layers offer, the following paragraphs describe how to build multi-lingual applications, multi-driver applications, multi-persona applications, applications with multiple service levels, and applications for multiple phone channels.
It should be kept in mind that the options shown here are examples, and that the various choices can, of course, be freely combined. Thus it is easily possible to build a multi-lingual application with two personas per language, offering three different service levels and automatic switching between business hours and weekends.
Many phone applications need to be provided in several languages at once, for example in English and Spanish. Most often the basic dialog flow remains the same for the different languages so that essentially only prompts and grammars need to be dynamically replaced. Prompts can consist of both pre-recorded audio and text-to-speech for dynamic content. In addition, multi-linguality can be combined with multi-persona design to address specific sub-groups within the appropriate language groups such as Spanish-speaking teenagers or English-speaking seniors.
When offering multi-lingual applications, there are various ways of switching between the different language versions. Two frequently used solutions are to either use different phone numbers for the different languages, or to have an initial question that asks for the caller’s preferred language.
VoiceObjects provides an easy mechanism to both obtain information about the currently active language, as well as to set it. A function LANGUAGE([languageCode]) is provided within the dialog context functions. When used without an argument it returns the code of the currently active language (e.g. en-US or de-DE). When called with an argument, it sets the active language. Thus e.g. LANGUAGE(fr-CA) switches to Canadian French. For more details about dialog context functions, refer to the Expression object in the Object Reference.
When using separate phone numbers for the different language versions, it is preferable to pass in the respective language code as part of the initial HTTP request from the media platform to the server. This can be done with the parameter vsLanguage. The request URL would then look something like this:
http://server:port/VoiceObjects/DialogMapping?VSN=application_name&vsLanguage=fr-CA
For more details, refer to Service URL configuration in Chapter 4 – Service Deployment in the Deployment Guide.
When dealing with multi-lingual applications it needs to be kept in mind that in addition to the application itself, the media platform also needs to support multiple languages in terms of speech recognition as well as, optionally, text-to-speech. Not all media platforms are currently capable of dynamically switching between different languages during a call. For more details, refer to the documentation of your media platform.
Each Grammar item that is defined within VoiceObjects has a Language field to define which language layer the item belongs to. For more details on the Grammar editor, refer to the Grammar object in the Object Reference.

The language selected from the drop-down list indicates the language layer that the respective Grammar item belongs to. Within a given grammar definition, multiple individual Grammar items for different language layers may be defined. VoiceObjects then dynamically selects the appropriate grammar (or grammars) at call time depending on the active language layer.
Thus to support both US English and US Spanish in an application, define appropriate Grammar items for both language layers as shown in this example:

While this screenshot shows the use of embedded grammars, the same mechanism applies to file references or dynamically generated grammars.
For performance reasons, most phone applications in deployment make use of file-based grammars. In these cases, the switching between different languages can also be achieved via dynamic Resource Locator objects that point to the appropriate grammar files. To do this, grammar files with the same name should be provided in separate folders for each language. The grammar itself can then be defined once as shown in this example:

Note that the Language field needs to be set to Default in this case, as this single grammar definition serves all different languages. The switching between the different languages is managed via the Resource Locator object:

It defines the location from which the grammar is loaded by the media platform at call time. To achieve that, the URL field contains the dynamic component [E:Current language] (Expression object in this case, as indicated by the “E” within the square brackets) which indicates the currently active language, e.g. en-US or fr-CA.
The complete URL path for retrieving the grammar is dynamically built at call time and will result in a path like
http://myserver.mycompany.com/VoiceObjects/Resources/Samples/PrimeInsurance/Grammar/en-US/ PrimeInsurance_CarMakeAndModel.grammar
By providing the appropriate grammar files in the respective folder locations, the correct one will automatically be selected when the language is switched during the call.
Outputs in multi-lingual applications can be managed through the language layer provided in each Output item (similar to handling the grammars) and through the use of dynamic resource locators. The following examples show how to use each of these separately as well as in combination.
Each Output Item defined in an Output object has a Language field to define which language layer it belongs to.

Any number of individual Output items can be defined.

It is also possible to distinguish between different varieties of the same language:

The use of the language layer is the best way to build multi-lingual applications when using text-to-speech (TTS) only, or when the languages are very different from a syntactical point of view, so that no common sentence patterns may be used. In many cases, however, the languages used in the application share basic sentence structures, and this can be exploited when concatenating audio files.
As described above for grammars, when dealing primarily with audio files it is preferable to switch between languages through dynamic Resource Locator objects. The same holds true for outputs, as it reduces the number of individual objects and items that need to be created.
Once again, when switching languages through dynamic Resource Locator objects only a single Output Item needs to be defined:

This Output Item references one or more Audio objects, which in turn define the audio files to be played, where to find them, and what their alternative TTS prompts are:

Switching between different languages is handled by redefining the resource locator:

It defines the location from which the audio is loaded by the media platform at call time. The URL field contains two dynamic components (Expression and Layer objects in this case, as indicated by the “E” or “L” within the square brackets):
· Current language
indicating the currently active language, e.g. en-US or fr-CA
· Persona
indicating the voice talent to be used, e.g. Bob or Alice
The complete URL path for retrieving the grammar is dynamically built at call time and will result in a path like
http://myserver.mycompany.com/VoiceObjects/Resources/Samples/PrimeInsurance/Audio/en-US/Bob/main_menu_first_claim.wav
By providing the appropriate audio files in the respective folder locations, the correct one will automatically be selected when the language is switched during the call.
When combining multiple layers, it can sometimes be convenient to use a hierarchical structure with embedded Output objects.
In the example shown here, a top-level Output object is used to branch between a caller layer (Novice or Frequent).

The corresponding embedded Output objects then apply combinations of subsequent custom and system layers.


VoiceObjects enables you to deploy a single application across multiple media platforms simultaneously. This is of particular interest in cases of heterogeneous media platform environments. It can also be used as a migration path towards newer versions of a platform or markup code.
One of the great advantages of VoiceObjects is that applications built with VoiceObjects are independent of the media platform. Only when deploying an application through a service does the user have to select a specific media platform driver. This is typically done in the Service editor:

To provide even more flexibility, the driver selected in the Service object can dynamically be overwritten by passing the appropriate value for the parameter vsDriver when initiating a service from the media platform itself with an HTTP request. For example, to request the use of the Voxeo Prophecy Hosting VoiceXML 2.1 driver, the request URL would then look like this:
http://server:port/VoiceObjects/DialogMapping?VSN=service_name&vsDriver=10
For more details, refer to Service URL configuration in Chapter 4 – Service Deployment in the Deployment Guide.
When building applications, there sometimes arises the need to include media platform-specific settings, e.g. to control the text-to-speech voice used, or to influence the behavior of the speech recognition engine. Within the VoiceObjects platform, these are called tuning parameters, and there is a dedicated section (the Tuning section) within the object editors to define them:

Tuning parameters can be standard parameters defined within VoiceXML in the voice and video channel, such as the confidence level for recognition, or custom parameters provided by a specific media platform.
Just like Grammar or Output objects, tuning parameters provide a language layer field. In addition, they also provide a driver layer field by which they can be attached to a specific media platform driver, and a channel layer field to activate a parameter only in a specific channel:

Note, that it is possible to select either a specific version of a media platform together with the recognition engine used (e.g. Comverse Media Server 3.0.2 (Nuance)) or just a media platform type (e.g. Comverse Media Server).
As usual, multiple layer settings can be combined. The screenshot below shows how this is done using the Nuance Café platform to switch between an English female and male voice:

Depending on conditional expressions (Voice is female, Voice is male) the custom tuning parameter bevocal.voice.name is set to either jennifer (a female voice) or mark (a male voice). Since this tuning parameter is only available for Nuance Café (formerly BeVocal), the driver layer is set to Nuance Cafe. This ensures that when the application is run on a different media platform, these settings are ignored and the default voice is used throughout. Finally, the language layer in this example is set to English, so these settings are also only applied if the active language is English. When run e.g. in German, the application may not make use of special layer settings.
For more information on tuning parameters refer to Tuning in the Object Reference.
Plug-ins provide the capability to integrate pre-existing VoiceXML-based applications into VoiceObjects applications running in the voice or video channel, as well as to make use of custom extensions provided by a specific media platform. Thus, when switching from one platform to the next a similar issue arises as with the tuning parameters discussed above.
Using plug-ins in multi-driver applications is best supported by a dynamic resource locator approach as explained above. The resource locator definition should include a Variable or Expression object indicating the media platform, and different Plug-In objects for the different media platforms need to be provided in an appropriate folder structure.
Even though significant consolidation towards the W3C SRGS-XML format is currently ongoing with regard to speech recognition grammar formats, not all media platforms currently support a joint grammar format. Depending on the platforms involved, it may therefore happen that when switching from one media platform to another, the format of the grammars used needs to be adapted.
This is best done following the approach of dynamic resource locators presented above. As for Output objects, there is a single definition of the grammar itself through a file reference:

Switching between the different grammar formats is done through the dynamically assembled resource locator:

It defines the location from which the grammar is loaded by the media platform at call time. The URL field contains three dynamic components (Expression objects in this case, as indicated by the “E” within the square brackets):
· Current language
indicating the currently active language, e.g. en-US or fr-CA
· Platform name
indicating the name of the media platform provider, e.g. Voxeo or Nortel
· ASR vendor name
indicating the name of the speech recognition engine provider, e.g. Nuance or IBM
The complete URL path for retrieving the grammar is built dynamically at call time and will result in a path like
http://myserver.mycompany.com/Samples/PrimeInsurance/Grammar/en-US/Voxeo/Nuance/PrimeInsurance_CarMakeAndModel.grammar
By providing the appropriate grammar files in the respective folder locations, the correct one will automatically be selected when the media platform is switched during the call.
As with grammars, there is growing convergence in the area of supported audio files. Most media platforms now support 8 KHz 16 Bit WAV files. In the area of video support, this convergence is still due to happen.
If it is necessary to switch between different audio or video file formats for different platforms, this is best done using dynamic resource locators as described in Handling outputs in multi-lingual applications. The resource locator definition should include a Variable or Expression object indicating the media platform, and different sets of audio and video files in the appropriate encodings need to be provided in an appropriate folder structure.
Current best practice in VUI design is to create personas that will engage callers in a natural and intuitive conversation. A persona of a voice application is expressed by the voice and use of language employed in an application. It forms the character or personality of the application, which, in turn, influences the caller's mental model of how to interact with it. Since it is common for phone applications to accommodate multiple demographic groups, the use of multiple personas is a way of connecting more closely with those individual groups.
Multi-persona applications can be created easily and efficiently using the mechanisms provided by VoiceObjects. This is especially true because switching between personas is not limited to different outputs but may also include different dialog flows (e.g. different confirmation strategies for caller input) or different ways of handling caller problems.
Switching outputs between different personas is structurally very similar to switching between outputs in different languages. Thus, the previously presented method using dynamic resource locators is also applicable here.
Modifying the settings in the Tuning section can also set different personas. This example from the paragraph on tuning parameters shows how to switch between a male and a female voice on Nuance Café:

Finally, there is the option of using a custom layer to switch between the different personas. This approach is preferable when using TTS, and when the personas have different textual output.

Switching personas may, of course, influence more than just the outputs. By placing the persona on a custom layer, it is easy, for example, to:
Add or remove menu entries depending on the active persona
Each individual Menu item has a Layer field that can be used for this purpose.
Use different event handling based on the active persona
Each event handling item provides a Layer field for this. A typical use case would be to have different personas for novice users and for expert users. The novice persona would provide multi-level help and guide the caller in case of problems, whereas the expert persona might just give brief messages to indicate a problem since the caller is assumed to know his way around.
Use different grammars based on the active persona
When switching outputs between different personas, it is sometimes also necessary to adjust the grammars accordingly. For this purpose, each Grammar item provides a Layer field.
Use different logging based on the active persona
Each Log item provides a Layer field to achieve this.
Layers are ideally suited to integrate multiple service levels into a single application, thus enabling a simple integrated maintenance process. The following three examples illustrate how to implement certain frequently occurring service level distinctions.
Many organizations want to modify their services based on the current date, day of week, or time. This can range from playing specific messages (Good morning vs. Good evening, Happy New Year, etc.) to switching the entire dialog flow. For instance, an application is only available during normal business hours, and presents either a standard announcement or the option to record a message outside these business hours.
The example shown here plays a time-dependent greeting for the caller (Good morning, Good afternoon, or Good evening). We start by determining the current time, using the NOW() function provided within the Expression object. For more details on its parameters, refer to the Expression object the Object Reference.

Based on the current time, we can then build a new Layer object Time of day, which includes three different states morning, afternoon, and evening and their according conditions.

Here are the definitions of the three different states morning, afternoon, and evening of the layer Time of day:



These layer states are then used as layer conditions within the output that plays the welcome message:

Of course, the Time of day layer can be used in other places, too, e.g. in Event Handling. If a caller encounters difficulties in the main menu, then after a certain number of retries we may want to transfer him or her to an agent for further assistance. This is of course only an option while agents are available, so in the evening we may want to revert to just playing a prompt. The following example shows how this is easily done by using layer conditions in event handling:

Two event handlings are defined for the same event (ASR No Input/Match with occurrence level >= 3), but with different layer conditions. The first one (Transfer to agent) is only activated if the layer Agent has the active state isAvailable. The other one (Terminate call) is activated if Agent does not have the active state isAvailable.
Sometimes it is desirable to offer a service under several different phone numbers targeted at different caller groups. The behavior of the application should then adapt to the individual number that was called.
This number can be retrieved via the Dialed Number Identification Service (DNIS). There are different ways of handling the mapping between DNIS and application:
DNIS mapping on the media platform
On most media platforms, a mapping can be defined between the DNIS and the application to be started. This mechanism can be used to initiate different services, as identified by their VSN (VoiceObjects Service Name), for different numbers.
A disadvantage of this approach is that some media platforms require a restart to pick up changes in the configuration.
DNIS mapping in the service
Within the Service editor in VoiceObjects Desktop, a comma-separated list of DNIS numbers can be provided (see also Chapter 2 – Configuring Servers and Services in the Deployment Guide).

The respective service is activated if the DNIS transmitted by the media platform matches one of the numbers in the list.
This approach has the advantage of requiring only a single configuration on the media platform, which directs all calls to the server together with the appropriate DNIS. The server then activates the appropriate service, and this mapping can easily be modified at any time without requiring a restart of either the media platform or the server.
DNIS mapping in the applications
The two methods described above dispatch different DNIS numbers to different services. In addition, it is of course possible to access the DNIS within an application itself and then to react accordingly, e.g. by enabling or disabling certain menu options.

In this approach, the media platform directs all calls to the same service, which is identified through the VSN instead of DNIS. Changes in the configuration are made by adapting the applications itself and then redeploying the service. Since VoiceObjects provides hot redeployment (see Chapter 3 – Managing Servers and Services in the Deployment Guide), this never causes any downtime.
Credit card companies typically offer different types of cards, associated with different levels of customer care. Holders of a green card receive the basic services such as checking their balance or reporting a stolen card. Gold card members have access to additional services, e.g. travel planning. And at the highest level, platinum cardholders receive personal service and are immediately transferred to a human agent. All of these distinctions depend on caller profiles that are accessed from the back-end system of the company, after the credit card number has been entered.
The following example shows how to structure a menu to distinguish between “green card” and “gold card” members. The dialog structure looks like this:

To implement this functionality, the Menu object uses a layer condition attached to the Travel Planning Menu item:

VoiceObjects supports building applications that can be run in any of the four supported phone channels voice, video, text, or Web. Chapter 10 – How to Support Multiple Phone Channels gives an introduction to this topic.
When building one application for multiple phone channels, implementation typically differs in the prompts and grammars used, less so in the actual call flow.
Prompts played in voice or video applications are usually more verbose than prompts displayed in text or Web applications. In addition, they might reflect a persona design, i.e. using first person, colloquial language and different ways of addressing the caller. It is for these reasons that voice prompts can or should not be re-used for text and Web applications. When designing an application for multiple channels, different prompts have to be written; the prompts used for the text and Web channel are typically the same, though.
To separate the prompts for each channel, use the Channel layer field in Output items. While voice and video prompts typically reference Audio and Video objects, respectively, text and Web prompts are entered directly into the Output items. Multi-linguality or multi-persona design in voice or video applications is implemented inside the Audio and Video objects, e.g. using dynamic Resource Locator objects as described in Handling outputs in multi-lingual applications. In the text and Web channel, different Output items need to be defined to achieve the same result. If one application supports all four channels, it is best practice to define one Output item each for the voice and video channel, but only one Output item for the text and Web channel. This single Output item should then reference an external Output object in which the various text and Web prompts can be defined. This reduces the complexity inside the original object editor (e.g. of an Input or Menu object).

The Output object could then look as follows:

The concept of grammars comes from the IVR world, where grammars are required to restrict the set of possible utterances a caller can make in a given dialog step in voice and video applications using ASR. Without grammars, free speech recognition over the telephone line would not be possible with today’s technology. While grammars can be used in Input objects to ask for information and map the response to an internal value (the slot value), grammars in places like custom or standard navigation or selection commands in the List object need to be matched to activate the underlying processing. In both cases, various different ways of saying the same thing need to be defined in order to have optimal speech recognition.
In a text or Web application, grammars can also be built to map caller input to slot values, so that the Input object can be used across all phone channels in the same manner. In this scenario, though, caller input is typically short and shows less variety, and in most cases the list of possible responses is so short that the options can also be directly displayed on the screen, so that the caller merely needs to select one. The underlying grammar is therefore much simpler in text and Web applications, as compared to speech grammars. Grammars in text and Web applications need to be written in the text-to-grammar (TTG) format.
In addition to the typical use cases of grammars in the voice and video channel, text and Web applications allow free caller input, as the text a caller responds with can directly be used as is in the application. For the latter use case, the question mark “?” must be used in the grammar definition to denote free text input.
Grammar definitions are therefore usually different among the channels. In the following examples, a built-in grammar is used for the voice and video channel (digits, with length=4), whereas free text input is signaled through the question mark “?” placeholder in the text and Web channel:
