[ieee 2009 proceedings of the 5-th conference on speech technology and human-computer dialogue...

Real-Time Architectures for a Network-Based Text-to-Speech Service Implementation

Mihai Surmei Faculty of Electronics, Telecommunications

and Information Technology University “Politehnica” of Bucharest, Romania

[email protected]

Dragos Burileanu Faculty of Electronics, Telecommunications

and Information Technology University “Politehnica” of Bucharest, and

Romanian Academy Center for Artificial Intelligence Bucharest, Romania

Cristian Negrescu, Catalin Ungurean, Aurelian Dervis Faculty of Electronics, Telecommunications and Information Technology

University “Politehnica” of Bucharest, Romania

Abstract — The communication as we know it is about to change, the simple peer-to-peer voice type of calls being replaced by a mix of different media belonging to the same session. Network-based speech processing technologies are increasingly used, adding flexibility, speed and naturalness to various telecom services. For instance, text-to-speech (TTS) synthesis is already incorporated into commercial telephony applications, but in most of the cases the services offered are not real-time. This paper proposes a solution in building up a network-based media processing environment and designing a multi-threaded and multi-process TTS engine, integrated with the rest of the network by means of flexible and standardized signaling protocols. As a particular application of the proposed approach, we present a convergent real-time telecom service, combining instant messaging, presence and TTS in a single new experience.

Keywords - network-based speech processing; TTS synthesis; real-time telecom service; IMS network; MRCP; instant messaging

I. INTRODUCTION Integration of multiple input-output modalities in the

interactive services delivered over telephone devices is a steady concern for the developers of the Next Generations Networks (NGNs), which represent the next step after traditional public switched telephone networks and more recent digital data networks. One possible solution is to include speech processing technologies such as synthesis and recognition in the broader list of network services already deployed on the market and constantly demanded by the end-user. Speech is not only the most natural and intuitive communication mean, but also allows for a more flexible, uniform, and standardized user interface and for better control and personalization [1], [2].

Text-to-speech synthesis is a perfect example of how speech technology can overcome common usability constraints appearing in mobile devices, adding flexibility and naturalness to different telecom services and helping in enhancing the user experience. Currently, there are many real-life implementations

of services involving TTS synthesis, offered by several companies or telephony operators, enabled by various kinds of technologies and located at different places on the telecommunication layers, centralized or embedded. Usually, speech output is complementary to the visual interface so that the user can select any of them.

However, it is important to observe that most of these TTS-based services, even if they follow a general client-server architecture, fall in one of the next two categories:

• Are not real-time, such as in e-mail or SMS (Short Message Service) reader applications [3].

• Ask for supplementary pieces of software (for example proprietary interfaces) at the client-side, and are usually tailored for a limited set of devices [4], [5], [6].

First, we must remark that reading e-mails or SMS’s is obviously useful, but rather limited especially because they are by definition non real-time. Real-time services bring the most tough challenges and strict design requirements.

Second, using proprietary interfaces brings increasing difficulties in integrating with the rest of communication network and poses a distinct threat for the owner of the network to get locked-in on a particular technology. Furthermore, observing the overwhelming evolution towards an “all IP” ecosystem, any solution based on legacy technologies such as TDM (Time Division Multiplex) to convey the resulted voice could be considered already obsolete.

In order to avoid the above mentioned limitations one could decide to embed the speech technology into the end-user terminals [7], [8]. Besides the increased processing power and storage-needed capabilities, another drawback of this approach is the over-specialization of the devices, transforming them in niche terminals. The best option here is to leverage on the most widespread type of terminals available and not to put tough constraints on them.

The research reported in this paper was funded by theRomanian Government, under the National Research AuthorityCNCSIS grant “IDEI” no. 782/2007.

But maybe the most important thing is the lack of a unified way of designing the service itself. We bring innovation by considering the text-to-speech conversion (together with a voice recognition process), as a fundamental component in the telecom service space. This approach will give structure and will lead to a canonical way of building innovative telecom services.

Our new approach is based on widespread network protocols, on multi-thread and multi-process practice, suited to a carrier-grade use for implementing real-time services. For this, a commonly used C++ framework has been used. Furthermore, the underlying signaling support, mainly MRCP (Media Resource Control Protocol) stack, has been added and closely coupled for building a Media Resource Server, opening the way for obtaining real-time TTS services.

To sustain this point of view we designed an end-to-end prototype telecom service leveraging on our own TTS technology. The service mixes together an IM (Instant Message) session, presence information and TTS conversion, as a real-time communication channel. In our knowledge this is the first such initiative for Romanian language.

The e-mail reader built by our team and presented in [9] served well as a proof of concept but it was inherently flawed in several ways, not suitable to be considered a basis of further developments:

• It was a single threaded application.

• The architecture was dependent on a dedicated hardware for adapting to the PSTN (Public Switching Telephony Network) line.

• An e-mail reader is limited in terms of service complexity being an off-line type of communication.

The current TTS system implementation (desktop version) has been enhanced and partially re-coded in order to be integrated in a multi-thread/multi-process environment, having in mind that an important characteristic of the TTS-based services would be the ability to sustain a real-time kind of communication.

II. HIGH LEVEL ARCHITECTURES There are several ways of grouping the main existing

telecom networks, one of the most common being related to mobility. Based on that we have:

A. Mobile networks In this category fall the widespread GSM (Global System

for Mobile Communications)/WCDMA (Wideband Code Division Multiple Access) networks, the old CDMA networks (with different flavors in US and Japan), and the recent WIMAX (Worldwide Inter-operability for Microwave Access) approach.

These networks have in common the roaming functionality, but not necessarily between technologies. In terms of number of subscribers and overall traffic, this kind of networks is the dominating one by far.

B. Fixed networks This type of network is closely linked with our telecom

legacy: the POTS (Plain Old Telephony System) oriented incumbent operators being the ones to decide the strategies and the way forward. Lately, the access types get diversified with wireless access as well.

In this context there is no concept of mobility, the subscriber being linked with a specific geographic location.

C. Nomadic networks One can place the common WLAN (Wireless Local Access

Network) under this umbrella. The main characteristic is the ability to re-associate with another domain access-point, but the handover support is still at his infancy. The 802.11r IEEE standard approaches the handover issues for VOIP (Voice Over IP) type of traffic.

One of the main drawbacks of the mobile and fixed networks was the inability to offer support for other services but voice and SMS. This was the reason of developing the intelligent networks, behaving as a signaling overlay network in order to have a better control over the way the calls are made and finally charged.

Nevertheless, this did not solve the problem. The new type of network was still addressing voice and SMS and nothing else. In the last years, a major change in the communication behavior happened: the end-user switched from a peer-to-peer single media communication such as voice call or shoot-and-forget messaging such as SMS, to a session oriented type of communication containing:

• Several kinds of media: voice, text, picture, movie, file share (upload/download).

• Extra information: presence, location, etc.,

all of these mixed together in a single session, possibly toward multiple recipients.

The reason of this change is the convergence of telecom and Internet, and this convergence leads to special measures: the IMS (Internet Multimedia Subsystem) network.

IMS is already standardized by 3GPP (3rd Generation Partnership Project) organization and several networks were already deployed. This network is access-agnostic as it is based on the Internet philosophy of abstracting the functionalities on several vertical layers. This means that a service build on top of IMS will equally work on a GSM network or a WLAN hotspot.

The IMS network leverages on the extensive use of SIP (Session Initiated Protocol) [10] in order to achieve the mixing of the media. An IMS call could be everything between a plain VOIP phone call and a rich session with voice, video, instant messaging, file sharing, presence and location information.

As a consequence of IMS capabilities, the messaging services are much more evolved than before. It is expected in the near future that SMS usage to flatten and to reach the zero growth rate, catching up with voice services which are already there. From now on, the growth will be sustained by the increasing usage of instant messaging and e-mail, from the

mobile phone, besides other means. Anyway, it is rather difficult or even impossible to foresee the next “killing application”, after SMS.

Coming from a totally different direction is the effort of Internet companies to cover the new communication needs. The philosophy here is to emphasize the capabilities of the terminals, boost their processing capabilities and transform the underlying communication network into a simple bit pipe. But the multimedia needs remain virtually the same for the end-user.

III. THE REAL-TIME NETWORK ARCHITECTURE PROPOSAL

A. The telecom service The possible services making use of the TTS/ASR

(Automatic Speech Recognition) conversions are quite diverse and one good way forward is to implement a reference architecture for speech synthesis in Romanian language, having the following characteristics:

• Leveraging on open and standardized protocols.

• Permanently following and updating to the latest developments in the telecom field, in order to ensure the proper inter-working with other functional elements from the ecosystem.

• It should be easy to implement a sub-architecture by eliminating some functionalities from the scope.

• Being general enough to be (later) extended by closing the loop via speech recognition.

To achieve this goal our team is developing a telecom service to be executed on this reference architecture, complex enough to be considered as a starting point for any commercial implementation. Fig. 1 depicts this kind of service.

The service flow will be the following:

• The terminals will be common GSM phones, capable to sustain a GPRS (General Packet Radio Service)/EDGE (Enhanced Data for GSM Evolution)/3G (Third Generation)/HSDPA (High-Speed Downlink Packet Access) connection.

Figure 1. Generic real-time TTS-based telecom service

• The terminals will send text and receive the voice channel with the TTS conversion result.

• Underlying network will consist of existing GSM/WCDMA infrastructure with an independent SIP signaling layer on top of it. This so called overlay network will provide the TTS functionality. If the operator already has a SIP infrastructure, this could be reused as well.

Analyzing the proposed service one can observe the fact that such a communication model: text – voice – text, practically involves two different media types mixed together in the same session, making IMS the natural choice as a supporting network:

• Text: SMS, e-mail or instant messaging.

• Voice: phone call, voice mail.

B. The IMS/SIP overlay network design – reference architecture Implementing a full-fledged IMS network is beyond the

scope of our research; our effort in that area is to integrate the existing IMS functionalities in order to have a proper test network to develop on. One possible theoretical network, a classical VOIP implementation based on SIP, is depicted in Fig. 2.

The minimal (standard VOIP) architecture from Fig. 2 contains a SIP server that will fulfill both the registrar and the application server (AS) roles in a generic SIP environment (SIP AS/Registrar). The underlying GPRS/3G network is represented by the GGSN (Gateway GPRS Support Node). The IP infrastructure needed for the SIP signaling (the straight arrows in the figure) connects network nodes (GGSN, SIP AS/Registrar, MRCP client and MRCP server) and the terminals involved in the conversation.

But this simple architecture does not allow mixing all the service capabilities we need, therefore we decided to use a different approach: we developed an IMS/SIP overlay network able to support a real-time TTS-based service. This reference architecture is depicted in Fig. 3.

Figure 2. TTS enhanced VOIP architecture

Figure 3. TTS service in IMS context

The architecture presented in Fig. 3 illustrates the main functional roles of an IMS network: SBC (Session Border Control) to sustain the mobile terminal connectivity through the operator’s firewall layers, HSS (Home Subscriber Server) used for subscriber authentication and authorization, CSCF (Call Session Control Function) to route the calls and the AS (Application Server) for the execution of the service itself, together with other telephony services, MRFC (Media Resource Function Control) and MRFP (Media Resource Function Processor) that will implement for instance the old IVR (Interactive Voice Response) or Voicemail functionalities, similar to the classical GSM or PSTN networks.

The TTS functionality is closely located to the MRFP/MRFC pair, being by definition a hybrid functional node, combining signaling and payload. There are several standardization efforts to cover the speech technology (both synthesis and recognition) in the telecom context. The most advanced one is the MRCPv2 protocol referred by 3GPP documents (TR 24.880).

The MRCP protocol allows client hosts to control media service resources such as speech synthesizers, recognizers, verifiers and identifiers residing in servers on the network [1], [11]. MRCP is not a “stand-alone” protocol; it relies on a session management protocol such as the SIP to establish the MRCP control session between the client and the server, and for rendezvous and capability discovery. It also depends on SIP and SDP (Session Description Protocol) [12] to establish the media sessions and associated parameters between the media source or sink and the media server. Once this is done, the MRCP protocol exchange operates over the control session established above, allowing the client to control the media processing resources on the speech resource server.

As of 3GPP recommendations, the MRCP client is to be implemented on the MRFP node, and the MRCP server build separately on a different hardware platform (see Fig. 3). The TTS engine is considered to be an application running on the

MRCP server, and closely linked with the signaling – SIP and payload – RTP (Real-time Transport Protocol) [13] connectors.

This architecture centered on the TTS functionality and using existing reference implementations for CSCF, HSS, SBC, MRFP/MRFC, and AS allows building convergent services with advanced TTS capabilities, the first such realization for Romanian language.

MRCP is increasingly popular and closely connected to the development of IMS networks. This is the reason we are implementing this protocol as a basis for integration with the rest of the network.

IV. A TTS-BASED COMMUNICATION SERVICE IMPLEMENTATION

The way of communicating is shifting towards a multi-session convergent experience and IMS networks expose the needed services to make this change possible. Our team developed a specific multimedia telecom service by mixing voice, presence and text-to-speech synthesis on the previously proposed reference architecture.

Let’s first consider the following multimedia flow:

• Two 3G users, A and B, are engaged in an instant messaging session over the phone (Fig. 4). For instance, at the moment of initiating the communication, both of them were walking downtown, this status, “Walking”, being registered beforehand in the IMS network (it means the AS from Fig. 3 will keep this status and correlates it with the user).

• User A reached his car and changed (by selecting an option in a menu on the phone) the status from “Walking” to “Driving”, therefore making the act of typing the text on the phone, let’s say, unsafe (Fig. 5). This new state will be registered in AS.

• Automatically, the AS will trigger the TTS conversions, helping user A to continue the messaging session using only voice.

The described service is to be consumed using mobile terminals (or to a lesser extent via PC clients). Designing an all-purpose IMS mobile terminal is beyond the scope of our research and is definitely a bleeding edge technology of the moment. Nevertheless, several clients covering VOIP, IM and presence already exists and they are running on Symbian or Windows mobile smartphones. Besides that, there are several free developing platforms that allow building custom IMS clients for the mobile operating systems mentioned before.

This service uses three distinct telecom services during the same session: instant messaging, voice and TTS conversion.

Considering Fig. 4 and Fig. 5, this service works as follows:

1) First we suppose both users are already registered in the HSS. The registration is a mandatory mechanism needed to support the routing of the calls between users.

Figure 4. IM session with both users “Walking”

Figure 5. User A is “Driving”

2) The presence status of both users is defined for instance as “Walking” and defined as such in the application server that will play the role of a presence server.

3) The scenario from Fig. 4 is a typical IM session over IMS: users A and B alternatively send text messages using the IMS clients installed on the mobile phones. Because we are developing a conversational service, using the SIP Message (RFC 3428) is not enough, the alternative being a SIP extension for session oriented messaging: the Message Session Relay Protocol (MSRP, RFC 4975).

4) In the Fig. 5, user A reached his car and changed the status to “Driving”

5) When user B sends a text message, the AS will reroute the text to the MRCP client, based on the new presence state of user A.

6) MRCP client queries the TTS engine, hosted on the MRCP server, using MRCP protocol for:

a) converting the text message;

b) opening an RTP flow towards the called subscriber. One can observe that the RTP flow does not belong to the overlay network, as it is not signaling; it will represent the content of the conversation.

7) Additionally, the flow could be continued and the loop closed using a speech recognition engine, implemented on the same platform as the TTS engine, and being part of the overall MRCP proposal. Therefore if user A wants to answer back, he will do this by speaking to the terminal, therefore opening another RTP flow (together with another SIP session, not showed on the figures for the sake of simplicity) up to the MRCP server. The ASR engine will route the resulted text message to user B, via the AS.

The implementation of the overlay IMS test network was realized by integrating existing free functional network nodes. This test network is mandatory for performing the needed development iterations of the TTS engine itself.

Integrating the TTS engine and needed MRCP signaling poses several challenges. The TTS place in the IMS ecosystem is close to the MRFC/MRFP pair because we do not only deal with control activities (IMS is in fact a control layer) but we have to deal with the RTP stream itself. Therefore, the TTS platform will gather two different types of technologies with different requirements: the MRCP (SIP based) signaling and the RTP stack, therefore being a monolithic type of architecture. The RTP implementation stack and TTS engine has strict timing and speed requirements, with tight QoS (Quality of Service) specifications.

This is the reason we choose the ACE (Adaptive Communication Environment) to develop this part. ACE is a collection of C++ pattern implementation heavily used in telecommunication industry [14]. This approach permits highly robust and fast implementations, mandatory to implement the RTP part of the architecture.

The core part of the TTS engine itself was modified in order to integrate it with the ACE framework. The basic requirement of using such frameworks is to ensure the code is thread-safe. Therefore, we replaced from the current (desktop) TTS engine version built by our collective (and described in [15] and [16]) all the standard C libraries function calls we know were not reentrant and we eliminated the use of static data storage for function returned values.

For the signaling part, JAVA EE (Java Enterprise Edition) was a clear choice, for maximum productivity and reuse capacity. The open source Sailfin project offers a reliable SIP servlet technology, suitable for implementing the needed signaling for the MRCP server. Because it relies on the Glassfish application server, the robustness and availability are guaranteed.

V. CONCLUSIONS The way of communicating is changing and we are

migrating from classical voice calls and short message exchange to rich multimedia sessions, beyond the usual peer-to-peer paradigm.

Because text-to-speech synthesis is a special type of processing offering unique man-machine interface, we emphasize its importance by adding it to the larger set of existing telecom services such as voice, messaging, location or presence.

We presented a convergent service example, combining TTS, instant messaging and presence in a unique session. This service is by definition real-time, in contrast with the majority of existing TTS based services such as e-mail or SMS reader.

Mixing various media in a single communication session is realized by means of an IMS test network, functioning as a transparent network over the existing communication infrastructure. Our implementation leverages on the MRCP protocol for integrating the TTS engine within the IMS ecosystem.

Designing the MRCP server able to offer TTS capabilities at a carrier grade level is a difficult task. We have modified our desktop TTS version to add support for multi-threading and multi-process. The MRCP server and client were developed as well.

Many technologies discussed in the paper will certainly evolve over time and sooner or later will be progressively less used. Therefore to avoid technology lock-in we focus our development on frameworks (such as ACE or JAVA EE) and telecom oriented pattern reuse such as connection management, service initialization, concurrency, error management, and event loop management.

Our proposed MRCP server is a monolithic realization, mixing together the payload and the signaling. We will investigate in the future the opportunity of splitting the monolithic architecture to a layered one, to decouple the signaling from payload.

REFERENCES [1] D. Burke, Speech Processing for IP Networks: Media Resource Control

Protocol (MRCP). New Jersey: John Wiley & Sons Ltd., 2007. [2] S. Rondel, P. T. Pattabhiraman, A. Ganapathiraju, P. Khademi, and J.

Rondel, “Strategic importance of speech technology for NGNs,” White paper. Redmond, WA: Conversational Computing Inc., 2007.

[3] G. Németh, G. Kiss, C. Zainkó, G. Olaszy, and B. Tóth, “Speech generation in mobile phones,” in Human Factors and Voice Interactive Systems, D. Gardner-Bonneau and H. Blanchard, Eds., 2nd Edition. New York: Springer, 2008, pp. 163-191.

[4] M. Bagein, O. Pietquin, C. Ris, and G. Wilfart, “Enabling speech based access to information management systems over wireless network,” Proc. of the 3rd workshop on Applications and Services in Wireless Networks – ASWN’03, Berne, 2003.

[5] T. Shimizu, Y. Ashikari, E. Sumita, H. Kashioka, and S. Nakamura, “Development of client-server speech translation system on a multi-lingual speech communication platform,” Proc. of the Int. Workshop on Spoken Language Translation, Kyoto, pp. 213-216, 2006.

[6] T. Svendsen, A. Egeberg, T. Holter, and T. Skogstad, “VOCALS – voice centric user interfaces for location based services,” Proc. of the 6th Nordic Signal Processing Symposium – NORSIG’05, Stavanger, Sept. 2005.

[7] D. Burileanu, “Spoken language interfaces for embedded applications,” in Human Factors and Voice Interactive Systems, D. Gardner-Bonneau and H. Blanchard, Eds., 2nd Edition. New York: Springer, 2008, pp. 135-161.

[8] R. Talafová, G. Rozinaj, J. Čepko, and J. Vrabec, “Multimedia SMS reading in mobile phone,” in International Journal of Mathematics and Computers in Simulation, vol. 1, Issue 1, 2007, pp. 12-17.

[9] M. Surmei, D. Burileanu, C. Negrescu, R. Pirvu, C. Ungurean, and A. Dervis, “Text-to-speech engines as telecom service enablers,” in Advances in Spoken Language Technology. Bucharest: Publishing House of the Romanian Academy, 2007, pp. 89-98.

[10] J. Rosenberg et al., SIP: Session Initiation Protocol. RFC 3261, June 2002.

[11] S. Shanmugham and D. Burnett, Media Resource Control Protocol ver. 2 (MRCPv2). Draft-ietf-speechsc-mrcpv2-12, March 2007.

[12] M. Handley and V. Jacobson, SDP: Session Description Protocol. RFC 2327, April 1998.

[13] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, RTP: A Transport Protocol for Real-Time Applications. STD 64, RFC 3550, July 2003.

[14] D. C. Schmidt, “The ADAPTIVE communication environment: object-oriented network programming components for developing client/server applications,” Proc. of the 12th Sun Users Group Conf., San Francisco, pp. 214-225, 1994.

[15] D. Burileanu, “Recent advances in text-to-speech synthesis in Romanian language,” in Proceedings of the Romanian Academy, Series A (Information Sciences). Bucharest: Publishing House of the Romanian Academy, in press.

[16] C. Ungurean, D. Burileanu, V. Popescu, C. Negrescu, and A. Dervis, “Automatic diacritic restoration for a TTS-based e-mail reader application,” in UPB Scientific Bulletin, Series C, vol. 70, Issue 4. Bucharest: Politehnica Press, 2008, pp. 3-12.

[ieee 2009 proceedings of the 5-th conference on speech technology and human-computer dialogue...

Documents