Multimodal Interaction Activity

Extending the Web to support multiple modes of interaction.

News

22 January, 2008: Implementation Report Plan of "EMMA: Extensible MultiModal Annotation markup language" Candidate Recommendation is modified with following point:

The Candidate Recommendation period is modified from "?? 2007" to "14 April 2008" as defined in the Candidate Recommendation.

11 December, 2007: EMMA: Extensible MultiModal Annotation markup language is a W3C Candidate Recommendation. Candidate Recommendation period closing 14 April, 2008. (Implementation Report Plan)

16-17 November, 2007: Workshop on W3C's Multimodal Architecture and Interfaces was held at Keio University in Japan. The Call for Participation, the Agenda, the Minutes, the Summary are available. The Issue List raised during the workshop is also available.

28 August 2007: The Workshop on W3C's Multimodal Architecture and Interfaces will be held at Keio University in Fujisawa, Japan on 16-17 November 2007. Position papers are due 5 October.

11 May 2007: Jerry Carter (Nuance), Rafah Hosn (IBM) and Kaz Ashimura (W3C) gave talks on "Multimodal Web to Expand Universal Access" at W3C Track in WWW2007 Conference, Banff, Canada.

9 April 2007: Second Last Call Working Draft for EMMA: Extensible MultiModal Annotation markup language

30 March 2007: The Multimodal Interaction Working Group is relaunched. The updated charter is available on the W3C Web server.

Introduction

The Mission

The Multimodal Interaction Activity seeks to extend the Web to allow users to dynamically select the most appropriate mode of interaction for their current needs, including any disabilities, whilst enabling developers to provide an effective user interface for whichever modes the user selects. Depending upon the device, users will be able to provide input via speech, handwriting, and keystrokes, with output presented via displays, pre-recorded and synthetic speech, audio, and tactile mechanisms such as mobile phone vibrators and Braille strips.

Multimodal interaction offers significant ease of use benefits over uni-modal interaction, for instance, when hands-free operation is needed, for mobile devices with limited keypads, and for controlling other devices when a traditional desktop computer is unvailable to host the application user interface. This is being driven by advances in embedded and network-based speech processing that are creating opportunities for integrated multimodal Web browsers and for solutions that separate the handling of visual and aural modalities, for example, by coupling a local XHTML user agent with a remote VoiceXML user agent.

Target Audience

The Multimodal Interaction Working Group (member only link) should be of interest to a range of organizations in different industry sectors:

Mobile: Multimodal applications are of particular interest for mobile devices. Speech offers a welcome means to interact with smaller devices, allowing one-handed and hands-free operation. Pen input enables handwriting, gestures, drawings and specialized notations. Users benefit from being able to choose which modalities they find convenient in any situation. The Working Group should be of interest to companies developing smart phones and personal digital assistants or who are interested in providing tools and technology to support the delivery of multimodal services to such devices.
Automotive Telematics: With the emergence of dashboard integrated high resolution color displays for navigation, communication and entertainment services, W3C's work on open standards for multimodal interaction should be of interest to companies working on developing the next generation of in-car systems.
Multimodal interfaces in the office: Multimodal has benefits for desktops, wall mounted interactive displays, multi-function copiers, and other office equipment, offering a richer user experience and the chance to use speech and pens as alternatives to the mouse and keyboard. W3C's standardization work in this area should be of interest to companies developing client software and application authoring technologies, and who wish to ensure that the resulting standards live up to their needs.
Multimodal interfaces in the home: In addition to desktop access to the Web, multimodal interfaces are expected to add value to remote control of home entertainment systems, as well as finding a role for other systems around the home. Companies involved in developing embedded systems and consumer electronics should be interested in W3C's work on multimodal interaction.

Current Situation

The Multimodal Interaction Working Group was launched in 2002 following a joint workshop between the W3C and the WAP Forum. The Working Group's initial focus was on use cases and requirements. This led to the publication of the W3C Multimodal Interaction Framework, and in turn to work on extensible multi-modal annotations (EMMA), and InkML, an XML language for ink traces. The Working Group has also worked on integration of composite multimodal input; dynamic adaptation to device configurations, user preferences and environmental conditions (now transferred to the Device Independence Activity); modality component interfaces; and a study of current approaches to interaction management. The Working Group has now been re-chartered through 31 January 2007 under the terms of the W3C Patent Policy (5 February 2004 Version). To promote the widest adoption of Web standards, W3C seeks to issue Recommendations that can be implemented, according to this policy, on a Royalty-Free basis. The Working Group is chaired by Deborah Dahl. The W3C Team Contact is Kazuyuki Ashimura.

We want to hear from you!

We are very interested in your comments and suggestions. If you have implemented multimodal interfaces, please share your experiences with us, as we are particularly interested in reports on implementations and their usability for both end-users and application developers. We welcome comments on any of our published documents. If you have a proposal for multimodal authoring language, please let us know. To subscribe to the discussion list send an email to www-multimodal-request@w3.org with the word subscribe in the subject header. Previous discussion can be found in the public archive. To unsubscribe send an email to www-multimodal-request@w3.org with the word unsubscribe in the subject header.

How to join the Working Group

If your organization is already a member of W3C, ask your W3C Advisory Comittee Representative (member only link) to fill out the online registration form to confirm that your organization is prepared to commit the time and expense involved in particpating in the group. You will be expected to attend all Working Group meetings (about 3 or 4 times a year) and to respond in a timely fashion to email requests. Further details about joining are available on the Working Group (member only link) page. Requirements for patent disclosures, as well as terms and conditions for licensing essential IPR are given in the W3C Patent Policy.

More information about the W3C is available, as is information about joining W3C.

Revised publication target dates

Specification	FPWD	LC	CR	PR	Rec
Multimodal Architecture and Interfaces	Completed	TBD	June 2008	December 2008	February 2009
EMMA	Completed	Completed	Completed CR period ends on 14 April, 2008	TBD	TBD
InkML	Completed	Completed	1Q 2008	TBD	TBD

Work in Progress

This is intended to give you a brief summary of each of the major work items under development by the Multimodal Interaction Working Group. The suite of specifications is known as the W3C Multimodal Interaction Framework.

Introduction, 6 May 2003. The Multimodal Interaction Framework introduces a general framework for multimodal interaction, and the kinds of markup languages being considered.
Use cases, 4 December 2002. Multimodal Interaction Use Cases describes several use cases that are helping us to better understand the requirements for multimodal interaction.
Core requirements, 8 January 2003. Multimodal Interaction Requirements describes fundamental requirements for the specifications under development in the W3C Multimodal Interaction Activity.

Current Work

The following indicates current work items. Additional work is expected on topics described in section 4 of the charter, including multimodal authoring, modality component interfaces, composite multimodal input, and coordinated multimodal output.

Multimodal Architecture

Requirements and Capabilities, 10 May 2004
First Public Working Draft, 22 April 2005
Second Working Draft, 14 April 2006
Third Working Draft, 11 December 2006

A loosely coupled architecture for the Multimodal Interaction Framework that focuses on providing a general means for components to communicate with each other, plus basic infrastructure for application control and platform services. Work is continuing on how the architecture can be realized in terms of well defined component interfaces and eventing models.

Extensible Multi-Modal Annotations (EMMA)

Requirements, 13 January 2003
Last Call Working Draft, 16 September 2005
Second Last Call Working Draft, 9 April 2007
Candidate Recommendation, 11 December 2007

EMMA is being developed as a data exchange format for the interface between input processors and interaction management systems. It will define the means for recognizers to annotate application specific data with information such as confidence scores, time stamps, input mode (e.g. key strokes, speech or pen), alternative recognition hypotheses, and partial recognition results etc. EMMA is a target data format for the semantic interpretation specification being developed in the Voice Browser Activity, and which describes annotations to speech grammars for extracting application specific data as a result of speech recognition. EMMA supercedes earlier work on the natural language semantics markup language in the Voice Browser Activity.

InkML - an XML language for digital ink traces

Requirements, 22 January 2003
Working Draft, 28 September 2004
Last Call Working Draft, 23 October 2006

This work item sets out to define an XML data exchange format for ink entered with an electronic pen or stylus as part of a multimodal system. This will enable the capture and server-side processing of handwriting, gestures, drawings, and specific notations for mathematics, music, chemistry and other fields, as well as supporting further research on this processing. The Ink subgroup maintains a separate public page devoted to W3C's work on pen and stylus input.

Related Materials

IST-FP6-001895 "Multimodal Web Interaction" (MWeb) Project: A W3C initiative funded by the European Commission in support of the development and adoption of W3C standards that enable multimodal Web access via mobile devices. MWeb includes European outreach and the development of demonstrators.
Openstream Multimodal Interaction use case demo (Macromedia Flash video).
The W3C Voice Browser working group published a set of requirements for multimodal interaction in July 2000. The working group also invited participants to demonstrate proof of concept examples of multimodal applications. A number of such demonstrations were shown at the working group's face to face meeting held in Paris in May 2000.
To get a feeling for future work, the W3C together with the WAP Forum held a joint workshop on the Multimodal Web in Hong Kong on late 2000. This workshop addressed the convergence of W3C and WAP standards, and the emerging importance of speech recognition and synthesis for the Mobile Web. The workshop's recommendations encouraged W3C to set up a multimodal working group to develop standards for multimodal user interfaces for the Web.
The IETF Speech Services Control (SpeechSC) working group is developing protocols to support distributed speech recognition, speech synthesis and speaker verification services, and expects to take advantage of W3C's work on the speech recognition grammar specification (SRGS), the speech synthesis markup language (SSML), semantic interpretetation (SI) and extensible multimodal markup annotations (EMMA). ETSI's STQ Aurora project is looking at codecs optimized for distributed speech recognition. See also David Pearce's presentation on DSR to the W3C VB/MMI working groups on 25th May 2005.
ETSI standard ES 202076 defines a generic spoken command vocabulary for controlling common operations such as calling someone by saying their name, browsing through a voice mail box, adjusting the volume, muting the microphone and other device properties. ETSI provide bindings for the vocabulary to a variety of human languages. This suggests the possibility of device-based recognition for common spoken commands together with network based recognition for other vocabularies.
Another idea is to couple a local graphical user interface with a remote voice dialog engine, perhaps based upon VoiceXML. Here the idea is to allow events to be passed between the device and the remote dialog engine. To the application developer, these events would look just the same whether they originated locally or remotely. In this approach, events can be used to initiate a range of actions, for instance, changing the focus of interaction, setting the value of a form field, loading a new page, or altering the current page via the DOM. W3C work on REX aims to provide an XML grammar for DOM events with a view to supporting distribution of events, and in principle, could be used to couple different modality components.
SIP can also be used to synchronize several devices, for instance to update the display on a PDA, automotive or desktop system in concert with the much smaller display on a cellphone. When it comes to setting up a session that potentially involves multiple devices and servers, SIP looks like it will provide an effective solution together with server-side scripts. The Voice Browser working group's work on call control may prove valuable.
ETSI EG 202 191 - V1.1.1 - Human Factors (HF); Multimodal interaction, communications and navigation guidelines (PDF). A study of design principles for multimodal applications with a focus on accessibility. Published August 2003.
InkXML specification (W3C Members only) contributed to W3C on 16th August 2002 by IBM, Intel, the International Unipen Foundation, and Motorola, Inc. InkXML is a markup language for the exchange of virtual ink, conveying such information as the kind of pen, the color of the ink and the nature of the medium, the pressure applied to the pen, its position and speed. InkXML can be used to exchange virtual ink among devices, such as handhelds, laptops, desktops, and servers. InkXML is intended to provide the ink component of Web-based multimodal applications. The working group consensus process will determine which ideas in InkXML will be taken up within W3C. W3C Members can view the contribution letter.
Multimodal browser architecture (PDF) by Stéphane Maes (IBM), dated 20th August 2001. Makes the case for using the model-view-controller paradigm and presents a variety of architectures for synchronization across modalities and devices. This is the presentation (T2-010705) that Stéphane gave to the 3GPP T2 meeting in September 2001.
Multimodal access position paper (PDF) by Nathalie Amann, Laurent Hue and Klaus Lukas (Siemens), dated 26th November 2001. Describes a possible architectures architecture for multimodal interaction based upon coupling a visual client with a VoiceXML interpreter.
Towards SMIL as a foundation for for multimodal, multimedia applications (PDF), by Jennifer Beckham (University of Wisconsin), Giuseppe Di Fabbrizio, and Nils Klarlund (AT&T Labs), dated 1 October 2001. Shows how SMIL can provide fine grained synchronization control for multimodal interaction. The approach combines SMIL with markup for control of speech engines.
XHTML+Voice, W3C Submission by IBM, Motorola and Opera Software. Dated 30th November 2001. Shows how markup for XHTML and VoiceXML can be combined to support multimodal interaction. An updated version was contributed to the Voice Browser and Multimodal Interaction working groups on 11th March 2003, see the Team Comment for details of associated IPR disclosures. W3C Members can view the contribution letter.
The SALT Forum was launched on 15th October 2001 with a mission to develop standards for speech enabling HTML, XHTML and SMIL. More recently, it has been applied to speech enabling SVG. The SALT 1.0 specfication was contributed to the Multimodal Interaction and Voice Browser working groups on 31rd July 2002, and the working group consensus process will determine which ideas in SALT will be taken up within W3C. W3C Members can view the contribution letter. The SALT+SVG profile was provided as a subsequent contribution.
3GPP is studying different ways to include speech-enabled services comprising both speech-only and multimodal services in 3G networks. One option for distributed speech recognition is based on the ETSI's STQ Aurora developments. Other options are dependent on the general study on speech enabled services. 3GPP may be interested in working on integrating remote access to speech synthesis resources. W3C should keep a watching brief. There is a possible connection to the IETF Speech Services Control Working Group (SpeechSC), which is developing protocols for distributed access to speech synthesis, recognition and speaker verification services (MRCP)

For more details on other organizations see the Multimodal Interaction Charter.

W3C Team Contacts

Kazuyuki Ashimura <kazuyuki@w3.org> - Multimodal Interaction Activity Lead

Copyright © 2002-2007 W3C^® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements. This page was last updated on $Date: 2008/01/22 19:15:35 $