How to Write a Grammar

Overview

When recognizing speech using the GetSpeech PerCL command, you need to specify a grammar to use. A grammar tells the speech recognizer what to listen for, and what to return to the application when speech is recognized. There are two primary ways to specify a grammar: using built-in grammars, or writing your own.

FreeClimb comes with a limited to set of built-in grammars. These grammars might be useful if you are trying to recognize digits and/or answer yes/no types of questions. See GetSpeech for details on what the available built-in grammars are.

Typically, you will need to recognize something outside of what is provided by the built-in grammars, and will need to write your own. This how-to guide will describe how to write two fairly simple grammars, covering some key concepts that apply to all grammars.

FreeClimb uses a speech recognizer that supports the SRGS (Speech Recognition Grammar Specification) XML grammar standard. These grammars also need to adhere to the SIMS (Semantic Interpretation for Speech Recognition) standard, which specifies how semantic information is returned from a grammar.


Grammar example: Simple list

The first grammar we cover shows how to present the user with a simple list of choices. The grammar below tells the speech recognizer to recognize a limited set of colors. We’ll describe the different elements used in the grammar below.

<grammar xmlns:sapi="http://schemas.microsoft.com/Speech/2002/06/SRGSExtensions" xml:lang="EN-US" tag-format="semantics-ms/1.0" version="1.0" root="Colors" mode="voice" xmlns="http://www.w3.org/2001/06/grammar" sapi:alphabet="x-microsoft-ups">
  <rule id="Colors" scope="public">
    <one-of>
      <item>
        <item repeat="0-1">light</item>
        <item>red</item>
        <tag>$._value = "red"</tag>
      </item>
      <item>
        <one-of>
          <item>light blue</item>
          <item>blue</item>
        </one-of>
        <tag>$._value = "blue"</tag>
      </item>
      <item><tag>$._value = "green"</tag>green</item>
    </one-of>
  </rule>
</grammar>

The element provides some necessary attributes about the grammar. For the most part, you should just copy the header from this example. The only attribute you will likely need to change is the root attribute. This should be set to the default/main rule ID of the grammar. Grammars can have multiple rules within a grammar file. When you recognize speech using the GetSpeech PerCL command, you specify the rule name in addition to the URL of the grammar file. If no rule is specified, the default rule of the grammar (as specified in the root) will be used. Each element has an id attribute which is where you specify the ID of the rule. Grammar rules can reference other grammars rules, and indeed, for any reasonable complex grammar, this is typically done. This lets you write complex grammars in a more modular fashion, and allows reuse of grammar rules across grammars.

Underneath the element is the element. This element is used to define one of many different alternatives that can be recognized. You can see in this sample there are three different colors that we are recognizing, red, blue and green. There is an element for each of them. To make things a bit more educational, we are allowing the caller to say either ‘light red’ or ‘red’ for the red option to match, and to say either ‘light blue’ or ‘blue’ for the blue option to match. We are not allowing this option for green alternative. And for the red and blue options we are presenting two different ways to make the ‘light’ optional.

For the red option, there are two elements under the top level . The first has a repeat attribute with a value of 0-1. This means the caller can say ‘light’ either 0 or 1 times and the grammar can match. The second does not contain this property which implies whatever is under the item must be spoken to match, so if the caller only spoke ‘light’, there would be no match. All sub-items must match for the parent item to match.

For the blue option, to make the ‘light’ optional, we are just adding another element under the item and under that are two options (s) the caller can speak, ‘light blue’ and ‘blue’.

For the green option, we just have the one top level that specified green.

For all options you will notice there is a element. The element is used to tell the recognizer what information is returned to the application. It specifies the semantic information returned from the grammar. The value that can be used in the element is specified by the W3 SIRS standard we mentioned above. It’s really a chunk of script code, and it can get quite complex, but you usually will only need to write simple values such as the ones in this sample grammar. In this sample, we are setting the output of the grammar (really the output of the root rule), which is represented by the $._value in this statement, to a string that holds the name of the color. There are many other ways to return information from the grammar using the element, and even ways that don’t require the using a element, but this is a pretty straight-forward way to do it.

The semantic information you specify is returned to your application in the recognitionResult property that is included in the callback request message sent to your application when speech is recognized (when the GetSpeech command returns with a reason of recognized). So if our sample grammar was used in a GetSpeech command, and the grammar matched (so recognition was successful), the recognitionResult would contain ‘red’ if the caller said ‘light red’ or ‘red’, ‘blue’ if the caller said either ‘light blue’ or ‘blue’, and ‘green’ if the caller said ‘green’.

User PhraseSemantic Information
“light blue”“blue”
“blue”“blue”
“green”“green”
"light red""red"

Grammar example: Using multiple rules

In the second grammar we use a common grammar feature of a rule referencing another rule. This helps with grammar modularity, and also makes it easier to specify repeated occurrences of a phrase. In the grammar below, we are going to collect a list of toppings for a pizza. We will do this using one grammar with two rules. The main rule will reference the topping rule. Let’s assume this grammar was used in a GetSpeech command that used the following prompt when asking the caller for pizza toppings: “Tell us what toppings you want on your pizza. You can select from the following: pepperoni, tomato, onion and sausage.”

Here’s the grammar:

<grammar xmlns:sapi="http://schemas.microsoft.com/Speech/2002/06/SRGSExtensions" xml:lang="EN-US" tag-format="semantics-ms/1.0" version="1.0" root="Toppings"  mode="voice" xmlns="http://www.w3.org/2001/06/grammar" sapi:alphabet="x-microsoft-ups">
  <rule id="Toppings" scope="public">
    <item repeat="0-1">I want</item>
    <item repeat="0-1">I would like</item>
    <tag>$._value = "";</tag>
    <item repeat="1-4">
      <ruleref uri="#Topping"/>
      <tag>$._value = $._value + $$._value;</tag>
    </item>
  </rule>
  <rule id="Topping" scope="public">
    <item>
      <item repeat="0-1">and</item>
      <one-of>
        <item><tag>$._value = "pepperoni "</tag> pepperoni </item>
        <item><tag>$._value = "tomato "</tag> tomato </item>
        <item><tag>$._value = "tomato "</tag> tomatoes </item>
        <item><tag>$._value = "onion "</tag> onion </item>
        <item><tag>$._value = "onion "</tag> onions </item>
        <item><tag>$._value = "sausage "</tag> sausage </item>
        <item>
          <item>all</item>
          <item repeat="0-1">of them</item>
          <tag>$._value = "all "</tag>
        </item>
      </one-of>
    </item>
  </rule>
</grammar>

Our main grammar rule, Toppings, first allows the user to optionally say ‘I want’ or ‘I would like’. Then it lets the user say one to four toppings by referencing the Topping rule. Referencing another grammar is easily done using the element as shown. You just need to set the uri attribute to point to the referenced grammar. The referenced rule can be in the same file by putting a # in front of the rule ID, or can be an external grammar using a regular web URL (e.g. http://website.com/grammar.grxml#rulename - or leave off the rulename if it is the default rule).The repeat attribute specifies how many times they can say a topping.

What is most interesting about this grammar is how we specify what gets returned to the application. Remember returning information is done using the element. You can see two of these in the main rule. In the first one, we set the output ($._value) to an empty string. We need to initialize it since we are going to be repeatedly appending to it the results of each returned result from calling the Topping rule. This is done in the second element. You can see that we are appending $$._value to our output. $$._value returns the output of the last run rule. Every time the Topping rule runs it returns here and we append what it returns. Notice that the values returned from the Topping rule contain a space, so topping choices are separated by a space.

The Topping grammar is a simple grammar that lets the user select one of our four toppings. First thing to notice is we are optionally recognizing an ‘and’ since our user will likely say something like ‘tomato and onion’ and we want to make sure our grammar will match that. The second thing is we are letting the user say ‘all’ (or ‘all of them’).

When building a grammar you need to think of how users will answer your prompts and both word your prompts to direct them to say what you are listening for, and/or make sure your grammars handle most cases of how your user will respond. If you have the time and resources, it is always good to go through grammar tuning and testing where you get real users to respond to your speech application, collect all their responses, and update your grammar to handle a higher percentage of user responses with a match.

Your application writer needs to understand the format, and possible values, that a grammar can return so as to handle matches appropriately. Here are some example phrases that this grammar matches, and the corresponding semantic information returned (in the recognitionResults property in the GetSpeech callback.

User PhraseSemantic Information
“I would like pepperoni”“pepperoni”
“I want pepperoni and onions”“pepperoni onion”
“Tomatoes, onions and sausage”“tomato onion sausage”
“Tomatoes all of them”“tomato all”

First notice how the returned string is built up, concatenating individual results returned from the Topping rule. Also notice that the last one really doesn’t make sense, but it could happen, and the application writer needs to handle it appropriately. The application could ask the user to clarify, or assume they meant all toppings and add a confirm prompt/recognition: “I think you want all the toppings. Is that correct?”.


Additional resources

Grammars can get very complex and be a very important part of an application that uses speech recognition. There is a large amount of resources available online to help you write good grammars. The links below are a great starting place:

Microsoft Basic Grammar Writing
Microsoft Constructing Grammars
Microsoft SRGS Grammar XML
W3 SRGS Specification
W3 SISR Specification