design of a gsm vcad/publications/tech-reports/1999/tr...design of a gsm v o co der using sp ecc...

Design of a GSM Vocoder using SpecC Methodology

Technical Report ICS-99-11February 26, 1999

Andreas Gerstlauer, Shuqing Zhao, Daniel D. Gajski

Department of Information and Computer ScienceUniversity of California, IrvineIrvine, CA 92697-3425, USA

(949) 824-8059

fgerstl, szhao, [email protected]

Arkady M. Horak

Motorola Semiconductor Products SectorSystem on a Chip Design Technology

Austin, TX 78731, USA

[email protected]


Technical Report ICS-99-11February 26, 1999

Andreas Gerstlauer, Shuqing Zhao, Daniel D. Gajski

Department of Information and Computer ScienceUniversity of California, IrvineIrvine, CA 92697-3425, USA

(949) 824-8059

fgerstl, szhao, [email protected]

Arkady M. Horak

Motorola Semiconductor Products SectorSystem on a Chip Design Technology


[email protected]

Abstract

This report describes the design of a voice encoder/decoder (vocoder) based on the European GSM standardemploying the system-level design methodology developed at UC Irvine. The project is a result of a cooperationbetween UCI and Motorola to demonstrate the SpecC methodology. Starting from the abstract executablespeci�cation written in SpecC di�erent design alternatives concerning the system architecture (componentsand communication) are explored and the vocoder speci�cation is gradually re�ned and mapped to a �nalHW/SW implementation such that the constraints are satis�ed optimally. The �nal code for downloading ontothe processors and the RTL hardware descriptions for synthesis of the ASICs are generated for the softwareand hardware parts, respectively.

Contents

1 Introduction 11.1 GSM Enhanced Full Rate Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Human Vocal Tract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Speech Synthesis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.3 Speech Encoding and Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 System-Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 SpecC Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 SpecC Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Speci�cation 42.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Formal, Executable Speci�cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Modeling Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Vocoder Speci�cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Coder Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Decoder Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Architectural Exploration 93.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Speci�cation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.2 Architecture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Exploration Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Analysis and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.2 Initial Simulation and Pro�ling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.4 Vocoder Analysis and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Architecture Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4.1 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4.2 Allocation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.3 Vocoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5.1 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5.2 Partitioning Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5.3 Partitioning for Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6.2 Vocoder Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Communication Synthesis 244.1 Protocol Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Transducer Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Protocol Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Vocoder Communication Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4.1 Protocol Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.2 Protocol Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

i

5 Backend 295.1 Software Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1.1 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.1.2 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.1.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 ASIC exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2.1 Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2.2 Architecture Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2.3 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Conclusions 42

References 44

A C Reference Implementation Block Diagrams 46A.1 Coder: coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

A.1.1 Encoding: coder 12k2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48A.2 Decoder: decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A.2.1 Decoding: decoder 12k2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54A.2.2 Post-processing: Post Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

B Vocoder Speci�cation 56B.1 General (shared) behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56B.2 Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

B.2.1 Preprocessing: pre process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58B.2.2 Linear prediction analysis and quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 58B.2.3 Open-loop pitch analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59B.2.4 Closed loop pitch analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60B.2.5 Algebraic (�xed) codebook analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62B.2.6 Filter memory updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65B.2.7 Serialization: Prm2bits 12k2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

B.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67B.3.1 Parameter extraction: Bits2prm 12k2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67B.3.2 Decoding of LP �lter parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67B.3.3 Decoding subframe and synthesizing speech . . . . . . . . . . . . . . . . . . . . . . . . . . . 68B.3.4 Post-�ltering: Post Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69B.3.5 Up-scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

C Simulation Results 72C.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

D ASIC Datapath Schematic 77

E SpecC Source Listing for the code book search 78

F Behavioral VHDL Source Listing for the code book search 95

G RTL VHDL Source Listing for the code book search 123

ii

List of Figures

1 Speech synthesis model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 SpecC methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Vocoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Coder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Timing constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 General speci�cation model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 General model for architectural exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1010 Architectural exploration ow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111 Sample operation pro�le. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1212 Estimates for computational complexity of coder parts. . . . . . . . . . . . . . . . . . . . . . . . . 1413 Breakdown of initial coder delays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414 Initial coder delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1415 Estimates for computational complexity of decoder parts. . . . . . . . . . . . . . . . . . . . . . . . 1416 Breakdown of initial decoder delays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417 Initial decoder delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418 Examples of mixed HW/SW architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1619 Allocation search tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1620 Component matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1721 Computational requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1722 Example of an encoder partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1923 Criticality of vocoder behaviors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2024 Balancing resource utilization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2025 Final vocoder partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2126 Channel partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2227 Sample encoder partition after scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2228 Final dynamic scheduling of vocoder tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2329 Breakdown of coder delays after exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2430 Breakdown of decoder delays after exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2431 Architecture model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2432 General model after protocol selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2433 Sample model after transducer synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2534 General communication model after inlining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2535 Vocoder model with processor bus protocol selected. . . . . . . . . . . . . . . . . . . . . . . . . . . 2636 Vocoder communication model after inlining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2637 Vocoder hardware/software interfacing model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2738 Original C source code example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3139 Assembly output of Motorola compiler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3140 Assembly code after optimizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3141 HLS design ow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3242 State-oriented models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3343 The sample encoder partition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3444 Scheduled encoder ASIC partition (Note: DataIn and DataOut FSMD for behaviors other than the

2nd Levinson-Durbin are omitted.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3445 The scheduled codebook search CHSFSMD model. . . . . . . . . . . . . . . . . . . . . . . . . . . 3546 Data-ow view of codebook search behavioral model. . . . . . . . . . . . . . . . . . . . . . . . . . 3647 A generic control unit/datapath implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3748 Hardware exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3849 Operation pro�le for one sub-frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3850 Behavior prefilter FSMD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

iii

51 Datapath diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3952 A FSMD implementation with a decomposed-CU . . . . . . . . . . . . . . . . . . . . . . . . . . . 4053 Control unit decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4054 sub-FSM in VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4155 Execution time distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4256 Critical path candidates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4357 Vocoder project tasks schedule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4558 Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5759 LP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5860 Open-loop pitch analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5961 Closed loop pitch search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6062 Algebraic (�xed) codebook search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6263 Filter memory update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6564 Coder block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6665 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6766 LSP decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6767 Subframe decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6868 Post �ltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7069 Decoder block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

iv


A. Gerstlauer, S. Zhao, D. Gajski

Information and Computer Science

University of California, Irvine

Irvine, CA 92697-3425, USA

A. Horak

Motorola Semiconductor Products Sector

System on a Chip Design Technology


Abstract

This report describes the design of a voice en-coder/decoder (vocoder) based on the European GSMstandard employing the system-level design methodol-ogy developed at UC Irvine. The project is a result ofa cooperation between UCI and Motorola to demon-strate the SpecC methodology. Starting from the ab-stract executable speci�cation written in SpecC di�er-ent design alternatives concerning the system archi-tecture (components and communication) are exploredand the vocoder speci�cation is gradually re�ned andmapped to a �nal HW/SW implementation such thatthe constraints are satis�ed optimally. The �nal codefor downloading onto the processors and the RTL hard-ware descriptions for synthesis of the ASICs are gen-erated for the software and hardware parts, respec-tively.

1 Introduction

For the near future, the recent predictions androadmaps on silicon semiconductor technology allagree that the number of transistors on a chipwill keep growing exponentially according to Moore'sLaw, pushing technology towards the System-On-Chip(SOC) era. However, we are increasingly experiencinga productivity gap where the chip complexity that canbe handled by current design teams falls short of thepossibilities o�ered by technology advances. Togetherwith growing time-to-market pressures this drives theneed for innovative measures to increase design pro-ductivity by orders of magnitude.

It is commonly agreed on that the solutions forachieving such a leap in design productivity lie in ashift of the focus of the design process to higher lev-els of abstraction on the one hand and in the mas-sive reuse of predesigned, complex system components(intellectual property, IP) on the other hand. In or-der to be successful, both concepts eventually require

the adoption of new methodologies for system designin the companies, backed-up by the availability of acorresponding set of system-level design automationtools.

At UC Irvine, evolving from a background in be-havioral or high-level synthesis, research has been con-ducted in this area for a number of years now. AnIP-centric system-level design methodology includinga new system speci�cation language called SpecC wasdeveloped. Currently, work in progress at UCI dealswith the creation of algorithms and tools for automa-tion of the steps in this design process.

UCI, in cooperation with Motorola, launched aproject in June 1998 with the goal of demonstrat-ing the SpecC methodology on a real design example.For this purpose, the design and implementation of avoice encoder/decoder (vocoder) for cellular applica-tions was chosen as an example. The actual designof the vocoder was done at UCI whereas Motorolais responsible for the �nal integration, implementa-tion, and eventually manufacturing of the chip. Inthis report we document the design of the vocoderchip employing the SpecC methodology as a result ofthe project part done at UC Irvine.

1.1 GSM Enhanced Full Rate Vocoder

The vocoder used in this project is based on the stan-dard for speech coding and compression in the Euro-pean cellular telephone network system GSM (GlobalSystem for Mobile Communications). The codecscheme was originally developed by Nokia and theUniversity of Sherbrooke [10]. The so called EnhancedFull Rate (EFR) speech transcoding is now standard-ized by the European Telecommunication StandardsInstitute (ETSI) as GSM 06.60 [9]. In addition, thesame codec has also been adopted as a standard for theAmerican PCS 1900 system by the Telecommunica-tions Industry Association (TIA) [11]. In general, thiscodec scheme and variations thereof are widely used

1

Short-termSynthesis Filter

Long-TermPitch Filter

ResidualPulses

+ Speech

Fixed codebook

10th-order LP filter

Delay / Adaptive codebook

Figure 1: Speech synthesis model.

in voice compression and encoding for speech trans-mission (e.g. [12]).

1.1.1 Human Vocal Tract

Conceptually, the main idea of a speech synthesisvocoder is based on modeling the human vocal tractusing digital signal processing (DSP) techniques in or-der to synthesize or recreate speech at the receivingside.

Human speech is produced when air from the lungsis forced through an opening between the two vocalfolds called the glottis. Tension in the vocal chordscaused by muscle contractions and forces created bythe turbulence of the moving air force the glottis toopen and close at a periodic rate. Depending on thephysical construction of the vocal tract, these oscilla-tions occur between 50 to 500 times per second. Theoscillatory sound waves are then modi�ed when theytravel through the throat, over the tongue, throughthe mouth and over the teeth and lips.

1.1.2 Speech Synthesis Model

The model assumes that the speech signal is producedby a buzzer at the end of a tube. The glottis producesthe buzz which is characterized by intensity (loudness)and frequency (pitch). The vocal tract (throat andmouth) is modeled by a system of connected losslesstubes.

Figure 1 shows the GSM vocoder speech synthesismodel. A sequence of pulses is combined with the out-put of a long-term pitch �lter. Together, they modelthe buzz produced by the glottis and they build theexcitation for the �nal speech synthesis �lter whichin turn models the throat and mouth as a system oflossless tubes.

The initial sequence of so called residual pulsesis constructed by assembling prede�ned pulse wave-forms taken out of a given, �xed codebook. The code-book contains a selection of so called �xed code vectorswhich are basically �xed pulse sequences with varying

frequency. In addition, the pulse intensities are scaledby applying a variable gain factor.

The output of the long-term pitch �lter is simplya previous excitation sequence, modi�ed by scaling itwith a variable gain factor. The amount by which ex-citations are delayed in the pitch �lter is a parameterof the speech synthesis model and can vary over time.The long-term pitch �lter is also referred to as adap-tive codebook since the history of all past excitationsbasically forms a codebook with varying contents outof which one past excitation sequence, the so calledadaptive code vector, is chosen.

Finally, the excitation which is constructed byadding �xed and adaptive codebook vectors is passedthrough the short-term speech synthesis �lter whichsimulates a system of connected lossless tubes. Tech-nically, the short-term �lter is a tenth order linear pre-diction �lter meaning that its output is a linear com-bination (linear weighted sum) of ten previous inputs.The ten linear prediction coe�cients are intended tomodel the reections and resonances of the human vo-cal tract.

1.1.3 Speech Encoding and Decoding

Instead of transmitting compressed speech samples di-rectly, the input speech samples are analyzed in or-der to extract the parameters of the speech synthesismodel which are then transmitted to the receiving sidewhere they are in turn used to synthesize the recon-structed speech.

On the encoding side, the input speech is analyzedto estimate the coe�cients of the linear prediction �l-ter, removing their e�ects and estimating the inten-sity and frequency. The process of removing the lin-ear prediction e�ects is performed by inverse �lteringof the incoming speech. The remaining signal calledthe residual is then used to estimate the pitch �lterparameters. Finally, the pitch �lter contribution is re-moved in order to �nd the closest matching residualpulse sequence in the �xed codebook.

At the receiver, the transmitted parameters are de-coded, combining the selected �xed and adaptive codevectors to build the short-term excitation. The linearprediction coe�cients are decoded and the speech issynthesized by passing the excitation through the pa-rameterized short-term �lter.

All together, this speech synthesis method has theadvantages of achieving a high compression ratio sinceit tries to transmit only the actual information inher-ent in the speech signal. All the redundant relation-ships which are due to the way the human vocal tractis organized are captured by the �lters of the speech

2

Compilation Interfacesynthesis

Backend

Simulation model

Simulation model

Manufacturing

Communication model

Simulation model

Simulation model

Implementation model

Validation ofalgorithm andfunctionality

Validation offunctionality andsynchronization

Validation offunctionality and

performance

Validation of

performancetiming and

High level synthesis

Protocol selection

Protocol inlining

Communication synthesis

Behavior partitioning

Synthesis flow Analysis and validation flow

IP

IP

Estimation

Estimation

Estimation

Estimation

Architecture exploration

Architecture model

Channel partitioning

Variable partitioning

Specificationmodel

Transducer synthesis

Figure 2: SpecC methodology.

synthesis model. The vocal tract model provides anaccurate simulation of the real world and is quite e�ec-tive in synthesizing high quality speech. In addition,encoding and decoding are relatively e�cient to com-pute.

1.2 System-Level Design

1.2.1 SpecC Methodology

The system-level design methodology which has beendeveloped at UC Irvine is shown in Figure 2 [3, 5]. Thesystem methodology starts with an executable speci�-cation. This speci�cation describes the functionalityas well as the performance, power, cost and other con-straints of the intended design. It does not make anypresumptions regarding the implementation details.

As shown in Figure 2, the synthesis ow of the code-sign process consists of a series of well-de�ned designsteps which will eventually map the executable speci�-cation to the target architecture. In this methodology,we distinguish two major system level tasks, namelyarchitecture exploration and communication synthe-sis.

Architecture exploration includes the design steps ofallocation and partitioning of behaviors, channels and

variables. Allocation determines the number and thetypes of the system components, such as processors,ASICs and busses, which will be used to implementthe system behavior. Allocation includes the reuse ofintellectual property (IP), when IP components areselected from the component library.

Behavior partitioning distributes the behaviors (orprocesses) that comprise the system functionalityamongst the allocated processing elements, whereasvariable partitioning assigns variables to memories andchannel partitioning assigns communication channelsto busses. Scheduling is used to determine the order ofexecution of the behaviors assigned to the processors.

Architecture exploration is an iterative processwhose �nal result is the de�nition of the system archi-tecture. In each iteration, estimators are used to eval-uate the satisfaction of the design constraints. As longas any constraints are not met, component and con-nectivity reallocation is performed and a new archi-tecture with di�erent components, connectivity, par-titions, schedules or protocols is evaluated.

After the architecture model is de�ned, commu-nication synthesis is performed in order to obtain adesign model with re�ned communication. The taskof communication synthesis includes the selection ofcommunication protocols, synthesis of interfaces andtransducers, and inlining of protocols into synthesiz-able components. Thus, communication synthesis re-�nes the abstract communications between behaviorsinto an implementation.

It should be noted that the design decisions in eachof the tasks can be made manually by the designer,e. g. by using an interactive graphical user interface,as well as by automatic synthesis tools.

The result of the synthesis ow is handed-o� to thebackend tools, shown in the lower part of Figure 2.Code for the software and hardware parts is gener-ated automtatically. The software part of the hand-o� model consists of C code and the hardware partconsists of behavioral VHDL or C code. The backendtools include compilers and a high-level synthesis tool.The compilers are used to compile the software C codefor the processor onto which the code is mapped. Thehigh-level synthesis tool is used to synthesize the func-tionality mapped to custom hardware and the inter-faces needed to connect di�erent processors, memoriesand IPs.

During each design step, the model is staticallyanalyzed to estimate certain quality metrics such asperformance, cost and power consumption. This de-sign model is also used in simulation to verify thecorrectness of the design at the corresponding step.

3

For example, at the speci�cation stage, the simulationmodel is used to verify the functional correctness ofthe intended design. After architecture exploration,the simulation model will verify the synchronizationbetween behaviors on di�erent processing elements(PEs). After communication synthesis, the simulationmodel is used to verify the performance of the systemincluding computation and communication.

1.2.2 SpecC Language

The methodology described in the previous section issupported by a new system-level description and spec-i�cation language called SpecC [1, 2, 4] which was de-veloped at UC Irvine in realization that existing lan-guages lack many of the features needed for system-level design [3]. At all stages of the SpecC method-ology, the current state of the design is representedby a model described in the SpecC language. In thishomogeneous approach transformations are made onthe SpecC description in contrast to a heterogeneousapproach where each step also transforms the designinto a new language, ending up with a mix of designrepresentations at di�erent stages of the process.

SpecC is being built as a superset of ANSI-C [6]which allows easy reuse of the existing algorithmic andbehavioral C descriptions that are common in todaysindustrial practice. SpecC contains all of the featuresrequired to support system-level design including IPintegration in general and the SpecC methodology inparticular:

� Structural and behavioral hierarchy.

� Concurrency.

� Communication with explicit separation of com-putation (behavior) from communication (chan-nel).

� Synchronization.

� Exception handling (traps and interrupts).

� Timing.

� Explicit state transitions (FSM modeling).

All these features are explicitly speci�ed in a clear andorthogonal manner which makes it easy to understandand analyze given SpecC descriptions both for humansand for automation tools. This is an essential require-ment to enable successful design automation and syn-thesis of high-quality results.

SpecC descriptions are translated into a C++model by the SpecC compiler. These C++ descrip-tions are then in turn compiled into a native exe-cutable for simulation and veri�cation. This resultsin very high simulation speeds due to the fact thatthe design is compiled instead of interpreted.

1.3 Overview

The rest of the report is organized as follows: Sec-tion 2 describes the speci�cation of the vocoder stan-dard in SpecC, including a more detailed descriptionof the functionality and other requirements. In Sec-tion 3 the di�erent steps performed during architec-tural exploration are shown. Starting with a gen-eral overview the �nal vocoder architecture is devel-oped. The process of mapping the abstract commu-nication onto real protocols, busses, etc. is describedin Section 4. Communication synthesis also shows therequirements for successful integration of intellectualproperty (IP). Again, a general discussion is followedby the vocoder speci�cs. Finally, Section 5 concen-trates on the synthesis of the system's software andhardware parts in the backend. Section 6 concludesthe report with a summary.

2 Speci�cation

2.1 General

The �rst step in any design process is the speci�ca-tion of the system requirements. This includes bothfunctionality as well as other requirements like timing,power consumption or area.

2.1.1 Formal, Executable Speci�cation

As outlined in the introduction, in the SpecC systemthe speci�cation is formally captured and written inthe SpecC language. As opposed to the informal spec-i�cations (e.g. in plain English) that have been com-monly used in the past, a formal speci�cation of thesystem has two main advantages:

1. It is executable for simulation and veri�cation ofthe desired functionality or for feasibility studiesat an early stage.

2. The formal, executable speci�cation directlyserves as an input to the following synthe-sis and exploration stages that eventually leadto the �nal implementation without the needfor time-consumingmodi�cations or translationsinto other languages and models.

4

2.1.2 Modeling Guidelines

The initial SpecC speci�cation should model the sys-tem at a very abstract level without already introduc-ing unnecessary implementation details. In addition,a good synthesis result using automated tools also re-quires the user to follow certain modeling guidelineswhen developing the initial speci�cation. Basically,the speci�cation should capture the required systemfunctionality in a natural way and in a clear and con-cise manner. Speci�cally, some of the main guidelinesfor developing the initial system speci�cation are:

� Separating communication and computation.Algorithmic functionality has to be detachedfrom communication functionality. In addition,inputs and outputs of a computation have to beexplicitly speci�ed to show data dependencies.

� Exposing parallelism inherent in the systemfunctionality instead of arti�cially serializing be-haviors in expectancy of a serial implementation.In essence, all parallelism should be made avail-able to the exploration tools in order to increaseroom for optimizations.

� Using hierarchy to group related functionalityand abstract away localized e�ects at higher lev-els. For example, local communication and localdata dependencies are grouped and hidden bythe hierarchical structure.

� Choosing the granularity of the basic partsfor exploration such that optimization possibil-ities and design complexity are balanced whensearching the design space. Basically, the leafbehaviors which build the smallest indivisibleunits for exploration should reect the divisioninto basic algorithmic blocks.

� Using state transitions to explicitly model thesteps of the computation in terms of basic algo-rithms or abstracted, hierarchical blocks.

The SpecC language provides all the necessary sup-port to e�ciently describe the desired system featuresfollowing these guidelines. Each of the modeling con-cepts like parallelism or hierarchy is reected in theSpecC description in an explicit and clear way.

2.2 Vocoder Speci�cation

The GSM 06.60 standard for the EFR vocoder con-tains a detailed description of the required vocoderfunctionality [9]. The standard description was trans-lated into a formal, executable SpecC speci�cation

��

��

coder

��

��

��

��

��

��

decoder

vocoderspeech bits

speechbits

Figure 3: Vocoder.

building the basis for the following synthesis and de-sign steps. In addition, the speci�cation was simulatedto verify the vocoder functionality.

The SpecC speci�cation was developed followingthe guidelines mentioned in the previous section. Inour case, part of the vocoder standard is a completeimplementation of the vocoder functionality in C (seeAppendix A). The C code is based on a 16-bit �xed-point implementation of the algorithms and it servesas a bit-exact reference for all implementations of thevocoder standard, i.e. the C code speci�es the vocoderfunctionality down to the bit level.

Therefore, for the vocoder, the C reference im-plementation builds the basis for the speci�cation inSpecC. However, a great amount of time had to bespent on analyzing and understanding the standardincluding the 13,000 lines of C code in order to extractthe high-level structure and the global interdependen-cies. Once this was done, a mapping into a SpecC rep-resentation was straightforward. Using the guidelinesof Section 2.1.2, the high-level picture of the vocoder'sabstracted functionality could be directly and natu-rally reected in its SpecC speci�cation.

As will be seen in the following sections this greatlyeases understanding of the vocoder basics and there-fore supports quick exploration of di�erent design al-ternatives at the system level in the �rst place. Ateach level, the SpecC description hides unnecessarydetails but explicitly depicts the major aspects, focus-ing the view of the user and the tools onto the im-portant decisions at each step. For example, at thelowest level, detailed algorithmic code is hidden in theleaf behaviors whereas the relations between the be-haviors are made explicit through state transitions.

In terms of the actual algorithmic behavior, SpecCbeing build on top of ANSI-C made it possible to di-rectly plug the C code of each basic function in the Cdescription into the corresponding leaf behavior of theSpecC speci�cation.

5

Filter memory

update

Closed-loop

pitch search

pitch search

Open loop

Algebraic (fixed)

codebook search

Linear prediction

(LP) analysis

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

prm[57]

speech[160]

bits

pre_process

code_12k2

prm2bits_12k2

mem

ory

2x per fram

e

A(z)

2 sub

frames

code_12k2

prm

speech[160]

sample coder

Figure 4: Coder.

2.2.1 Overview

Figure 3 shows the top level of the vocoder speci�ca-tion in SpecC consisting of independent coding anddecoding subbehaviors running in parallel. The coderreceives an input stream of 13 bit wide speech samplesat a sampling rate of 8 kHz, corresponding to an inputbit rate of 104 kbit/s. It produces an output bit streamof encoded parameters with a bit rate of 12:2 kbit/s.Decoding, on the other hand, is the reverse process ofsynthesizing a reconstructed stream of speech samplesfrom an input parameter bit stream. The followingsections will describe the encoding and decoding pro-cesses in more detail in so far as they are relevantfor the following discussions about the vocoder designprocess. See Appendix B for an in-depth descriptionof the vocoder functionality in SpecC.

2.2.2 Coder Functionality

Coding is based on a segmentation of the incomingspeech into frames of 160 samples corresponding to20ms of speech. Speech parameters are extracted ona frame-by-frame basis. For each speech frame thecoder produces a frame of 244 encoded bits resultingin the aforementioned output bit rate of 12:2 kbit/s.

Figure 4 shows the �rst two levels of the coder hi-erarchy. At the top level, the coder consists of threemain parts which execute in a pipelined fashion:

1. Pre-processing: Bu�ering of the incomingspeech stream into frames of 160 samples. Initialhigh-pass �ltering and downscaling of the speechsignal.

2. Encoding: The main encoding routine whichextracts a set of 57 speech synthesis parametersper speech frame. Encoding will be described inmore detail in the following paragraphs.

3. Serialization: Conversion and encoding of theparameter set into a block of 244 bits per frame.Transmission of the encoded bit stream.

Due to the pipelined nature, all three parts operatein parallel but each on a di�erent frame, i.e. while aframe is encoded the next frame is pre-processed andbu�ered, and the previous frame is serialized.

In general, encoding uses an analysis-by-synthesisapproach where the parameters are selected in such away as to minimize the error between the input speechand the speech that will be synthesized on the decod-ing side. Therefore, the encoder has to simulate thespeech synthesis process.

In order to increase reaction time of certain �lters,the main encoding routine (also shown in Figure 4)further subdivides each frame into subframes of 40samples (or 5ms) each. Depending on their criticality,parameters are computed either once per frame, onceevery two subframes or once per subframe.

Encoding Encoding itself basically follows the re-verse process of speech synthesis. Given the speechsamples, in a �rst step the parameters of the LP �lterare extracted. The contribution of the LP �lter is thensubtracted from the input speech to get the remain-ing LP �lter excitation. The LP �lter parameters areencoded in so called Line Spectral Pairs (LSPs) whichreduce the amount of redundant information. Twosets of LP parameters are extracted per frame, takinginto account the current speech frame plus one half ofthe previous frame. LP analysis produces a block of 5parameters containing the two LSP sets.

Next, using the past history of excitations, all thepossible delay values of the pitch �lter are searchedfor a closest match with the required excitation. Thesearch is divided into an open-loop and a closed-loopsearch. A simple open-loop calculation of delay esti-mates is done twice per frame. In each subframe aclosed-loop, analysis-by-synthesis search is then per-formed around the previously obtained estimates toobtain the exact �lter delay and gain values.

The long-term �lter contribution is subtraced fromthe excitation. The remaining residual is the inputto the following �xed codebook search. For each sub-frame an extensive search of the codebook for the clos-est code vector is performed. All possible code vectors

6

Searchcodebook

Prefilterresponse

Updatetarget

Prefiltercode vector

Calculatecodebook gain

Codebook

pitch delayFind

Computecode vector

pitch gainCalculate

Impulseresponse

Targetsignal

Closed_loop

Synthesizespeech

Update filtermemories

Quantizecodebook gain

Update

Find open looppitch delay

Open_loop

speechWeighted

2 su

bfr

ames

Interpolation &LSP -> A(z)

QuantizationLSP

Interpolation &LSP -> Aq(z)

Windowing &Autocorrelation

Windowing &Autocorrelation

Levinson-Durbin

Levinson-Durbin

LP_analysis

A(z) -> LSP

2x per frame

2 subframes

code_12k2

Figure 5: Encoding.

are searched such that the mean square error betweencode vector and residual is minimized.

For each subframe the coder produces a block of 13parameters for transmission. Finally, using the calcu-lated parameters the reconstructed speech is synthe-sized in order to update the memories of the speechsynthesis �lters, reproducing the conditions that willbe in e�ect at the decoding side.

Figure 5 exposes the next level of hierarchy in theencoding part, showing more details of the encodingprocess. Note that for simplicity only the behavioralhierarchy and no structural information is shown, i.e.the diagram doesn't include the information aboutconnectivity between behaviors. A complete block di-agram of the coder which provides an idea about thecomplexity by exposing all levels of hierarchy downto the leaf behaviors can be found in Appendix B onpage 66 (Figure 64).

As can be seen, at this level, the coder speci�cationexhibits some limited explicit parallelism. However, ingeneral, due to the inherent data dependencies boththe coder and decoder parts of the system are mostlysequential in their natures.

��

��

��

��

��

��

��

��

��

��

��

��

decode_12k2

Post_Filter

Bits2prm_12k2

Decode

LP parameters

4 sub

frames

bits

speech[160]

A(z)

synth[40]

synth[40]

prm[57]

prm[13]

decoder

Figure 6: Decoder.

2.2.3 Decoder Functionality

Decoding (Figure 6) basically follows the speech syn-thesis model in a straightforward way and is more orless the reverse process of encoding. The decoder re-ceives an encoded bit stream at a rate of 12; 2 kbits/sand reproduces a stream of synthesized speech samplesat a sampling rate of 8 kHz. For each incoming frameof 244 encoded bits a frame of 160 speech samples isgenerated.

Incoming bit frames are received and the corre-

7

sponding set of 5 + 4 � 13 = 57 speech parametersis reconstructed. The �rst 5 parameters containingthe Line Spectral Pairs are decoded to generate thetwo sets of LP �lter parameters. Then, once for eachsubframe the following blocks of 13 parameters eachare consumed, decoded and the speech subframe of 40samples is synthesized by adding the long-term pitch�lter output to the decoded �xed code vector and �l-tering the resulting excitation through the short-termLP �lter. Finally, the synthesized speech is passedthrough a post �lter in order to increase speech qual-ity.

A more detailed block diagram of the decoder show-ing all levels of hierarchy down to the leaf behaviorscan be found in Appendix B, Figure 69 on page 71.Compared to the encoding process, decoding is muchsimpler and computationally much cheaper.

2.2.4 Constraints

Transcoder Delay The GSM vocoder standardspeci�es a constraint for the total transcoder delaywhen operating coder and decoder in back-to-backmode. According to the standard, back-to-back modeis de�ned as passing the parameters produced by theencoder directly into the decoder as soon as they areproduced. Note that this de�nition doesn't includeencoding and decoding, parallel/serial conversions, ortransmission times of the encoded bit stream. Back-to-back mode is not considered as the connection ofthe coder output with the decoder input. Instead, the57 parameters produced by the encoder are assumed tobe passed directly into the decoder inside the vocodersystem.

The transcoder delay is then de�ned as the delaystarting from the time when a complete speech frameof 160 samples is received up to the point when the lastspeech sample of the reconstructed, synthesized frameleaves the decoder. The GSM EFR vocoder standardspeci�es a maximum timing constraint of 30ms forthis transcoder delay.

Analysis and Budgeting In addition to the explic-itly given transcoder delay constraint the requirementson the input and output data rates pose additionalconstraints on the vocoder timing. All requirementsof the standard were analyzed to derive timing bud-gets for di�erent parts of the vocoder, resulting in theactual constraints of the SpecC description.

Figure 7 depicts an analysis of the transcoder de-lay constraint. Note that the time di�erence betweenthe �rst and the last sample of synthesized speech

Pre_process

LP_analysis

Open_loop

Subframe

SubframeCode

Code

Open_loop

Subframe

SubframeCode

SubframeDecode

Code

Decode

SubframeDecode

D_lsp

Subframe

Subframe

Decode

Coder

prm

prm

prm

prm

prm

Start

speech

speech

speech

speech

Start

5 m

s5

ms

5 m

s5

ms

Ou

tpu

t sp

eech

fra

me

(20

ms)

Inp

ut

fram

e ra

te (

20 m

s)

Tra

nsc

od

er d

elay

<=

30 m

s

Decoder

Pre_process

LP_analysis

Open_loop

CodeSubframe

D_lsp

DecodeSubframe

Coder

Decoder

Del

ay <

= 10

ms

Figure 7: Timing constraints.

at the decoder output is 20ms (with the given sam-pling rate). Therefore, if encoding and decoding wouldhappen instantaneously in zero time the theoreticallyachievable minimum for the transcoder delay is 20ms,too. In other words, the �rst sample of reconstructedspeech has to leave the decoder not more than 10msafter the input speech frame is received.

Hence, encoding and decoding of the �rst subframeof 40 speech samples has to happen in less than 10ms.This includes all the information needed for the �rstsubframe, i.e. encoding and decoding of the 5 LP�lter parameters plus the set of 13 parameters forthe �rst subframe. Then, while the speech samplesare written to the decoder output at their sampling

8

rate, the following three subframes have to be encodedinto blocks of 13 parameters and decoded into recon-structed speech subframes such that the following sub-frames are available at intervals of at most 5ms.

However, while encoding and decoding of the cur-rent frame take place the next frame is already re-ceived and bu�ered, and processing of the next framewill have to start once its last sample is received.Therefore, an additional, implicit constraint is thatencoding and decoding of a complete frame of 160samples have to be done in less than the intra-frameperiod of 20ms. Hence, decoding of the last subframewill have to be done before that time or|in relationto the transcoder delay constraint|up to 10ms beforethe last sample of the synthesized speech frame at thedecoder output. Note that this requires a bu�ering ofthe decoded speech subframes at the decoder output.

To summarize the constraints for the vocoder, thereare two basic timing constraints derived from the giventime budgets:

� The encoding and decoding delay for the �rstsubframe (5 + 3 parameters) has to be less than10ms.

� The time to encode and decode a complete frame(all 57 parameters) has to be less than 20ms.

3 Architectural Exploration

The goal of architectural exploration is initially toquickly explore a large number of target architectures,comparing them to each other after an initial mappingof the design onto the architecture has been done and�nally pruning the design space down to a few can-didate architectures. These system architectures arethen evaluated further, trying to improve the map-ping of the design onto the architecture and possiblymodifying certain aspects of the system until a �nalarchitecture is selected.

Exploration is a part where the design processcan bene�t to a great deal from human experience.Therefore, interactivity is an important requirementof system-level design environments. However, withthe help of automated tools that quickly search largeparts of the design space, provide feedback about de-sign quality, perform tedious, time-consuming jobs au-tomatically, etc. the designer will be able to explorea large number of promising design alternatives in ashorter amount of time. In general, exploration is aniterative process where the di�erent steps describedin the next sections are repeated for di�erent archi-

Figure 8: General speci�cation model.

tectures. Design automation tools signi�cantly reducethe time needed for each iteration.

Since the corresponding tools are not yet availableat this time exploration for the vocoder project hadto be done mostly manually. However, manual explo-ration strictly followed the ow and the step-by-stepprocedures proposed for the implementation of the au-tomated tools. Nevertheless, due to the lack of toolswe had to restrict ourselves to a very small number ofcandidate architectures.

3.1 Models

3.1.1 Speci�cation Model

The initial speci�cation written by the designer isthe basis for architectural exploration. The speci�-cation model at the input to architectural explorationis shown in Figure 8. The speci�cation model is a su-perstate �nite state machine (SFSM) with hierarchyand concurrency.

At each level of hierarchy, the superstates (SpecCbehaviors) are decomposed further into either paral-lel or sequential substates. Superstates at the bottomof the hierarchy are called leaf states (or leaf behav-iors). They �nally contain actual program fragmentsdescribing the algorithmic behavior of the leaf states.

In a sequential composition of states at any level ofhierarchy the states are traversed in a stepwise fashion.After attening of the hierarchy, the model resemblesa standard state machine with the additional featureof parallelism. In contrast to a classical low-level FSMor FSMD, however, superstates can take an arbitrary

9

ASIC1

ASIC2

Bus

MPEG IP

DSP Bus

ASIC3

DSP core

CPU core

Figure 9: General model for architectural exploration.

amount of time to execute the statements or substatescontained within. Much like a dataow model, a statedoesn't start to execute until all of its predecessors are�nished. In addition, control ow is introduced by thepossibility to augment transitions with conditions.

3.1.2 Architecture Model

Figure 9 shows the general target or output model ofarchitectural exploration. An architecture consists ofa set of processing, memory and communication com-ponents. Processors and memories are connected to-gether by communication links, forming a bipartitegraph. During exploration, given the initial speci�ca-tion, behaviors are mapped to processors and commu-nication is mapped to the network in the architecture.The architecture can contain multiple instances of thesame component.

Components are taken out of a library or databaseof available component types. The library containsthe functional models for simulation together with in-formation for exploration (e.g. about speed or cost).

In general, the functionality of the library com-ponents ranges from fully customizable componentswhich can implement any behavior and any interfaceto �xed components with prede�ned behaviors and in-terfaces. Flexibility in the component functionality isexploited by synthesizing customized versions of thecomponents during the design process. If part or all ofthe component functionality is �xed the componentsare also referred to as intellectual property (IP) com-ponents.

Processing Components In general, processingelements (PEs) can be arbitrary programmable soft-ware processor cores or non-programmable hardwarePEs. Examples for software processors are digital sig-nal processing (DSP) cores or general purpose (GP)CPU cores. Examples of hardware processors arefully synthesized application speci�c integrated cir-cuit (ASIC) components or IP components with �xed,prede�ned functionality. Processors are considered tohave a single thread of control. If a processor can exe-cute behaviors in parallel it is split into logical single-threaded processing components for exploration. Forexample, an architecture with multiple hardware pro-cessors will have several custom-hardware FSMDs al-though all logical HW processors or FSMDs might endup on the same physical chip.

Processing components are fully synthesized ortaken out of an IP library. In general, IP processingcomponents are hardware modules with all or partsof their functionality �xed, e.g. an MPEG hardwarecomponent or a software processor which can be pro-grammed to implement a large range of behaviors butnevertheless has a �xed, prede�ned interface, for ex-ample. Processing components in the library are char-acterized by their functionality and parameters likecost, power, speed, etc.

Memory Components In addition to the purefunctionality, system-level processing componentsgenerally include some sort of local memory. Alongwith the behaviors, variables and storage in general ismapped onto the local memories of the processors. Aspecial case of system components are (shared) memo-ries. The functionality of memory components is lim-ited to simple reading and writing of their local mem-ories. Global variables in the speci�cation which areshared by two behaviors mapped to two di�erent pro-cessors can either be mapped to the local memories ofthe processor or to a shared memory in case such acomponent is included in the system architecture.

Communication Components The communica-tion components of the interconnect network handlethe communication between the processing and mem-ory components. Therefore, they include any type ofcommunication media used to implement system-levelcommunication. For example, busses are commonlyused for interconnection at the system level.

Similar to processing components, communicationcomponents are taken out of a library of available in-terconnect types characterized by their functionality(protocol) and parameters like delay, throughput, etc.

10

Specification

Partitioning

Allocation

SimulationProfiling

Estimation

IP

Scheduling

Architecture

Estimation

Components

Implementation

Figure 10: Architectural exploration ow.

Protocols in the library are usually at least partly pre-de�ned, implementing industry standards like PCI orVME. On the other hand, protocols are customized orfull-custom protocols are synthesized during the de-sign process.

3.2 Exploration Flow

The general ow of steps performed during architec-tural exploration is shown in Figure 10. With thedesign speci�cation at the input, exploration createsan architecture for implementation of this design.

Initially, during simulation the speci�cation is pro-�led to extract estimates about the design com-plexity and the dynamic behavior. Using theseimplementation-independent estimates a set of com-ponents is selected out of a library during allocation.Allocation is based on matching components with thecomputational requirements and the available paral-lelism of the speci�cation.

Once a set of components has been selected, designmetrics like cost, delay, etc. of implementing the partsof the speci�cation on these components are estimatedtaking into account information obtained during ini-tial pro�ling. In the next step, partitioning is per-formed to map the design onto the components based

on these component-related estimates. During parti-tioning design trade-o�s related to implementation ofbehaviors on di�erent components and parallel versussequential execution of behaviors are explored.

After the design has been mapped onto the al-located components the design space is reduced toa single implementation of the design parts on thecomponents they are bound to. A more accurate re-estimation of this implementation is then performed.Finally, the design is scheduled with this informationto derive the actual system timing. Constraints areveri�ed and depending on the severity of the viola-tions a new iteration of the exploration loop is startedwith reallocation or repartitioning until an optimal ar-chitecture has been found.

3.3 Analysis and Estimation

3.3.1 General Discussion

The basis for any exploration of the design space|including system-level architectures|is the availabil-ity of good and useful design quality metrics. Onthe one hand, metrics are closely related to the de-sign constraints like performance, power or cost (area,code size, etc.). On the other hand, other metrics canprovide additional useful information. Basically, thesemetrics are the only means of deciding how good a cho-sen architecture is in comparison with other possiblearchitectures. Therefore, analysis of the speci�cationand estimation of the design metrics for di�erent im-plementations is an integral part of the design process.

Behavior Estimation During architectural explo-ration behaviors are mapped to the processing com-ponents of the architecture. For each type of targetprocessor the behaviors will exhibit di�erent metrics,i.e. di�erent cost, di�erent performance, etc. There-fore, by combining the properties of the speci�cationwith the abstracted properties of the target compo-nents estimates about the behavior metrics are ob-tained without actually implementing the behavior onthe component.

However, di�erent implementations of a behavioron the same processor resulting in di�erent metricshave to be considered during architectural exploration,too. For example, the behavior can be optimized forcost resulting in the least-cost solution. At the otherend of the spectrum, a behavior can be optimized forperformance to achieve the fastest execution possible.Based on the estimated values, the exploration toolswill assign budgets for cost or performance, for exam-ple, to behaviors or groups of behaviors. This infor-

11

mation is then passed to the backend where it is usedto guide the synthesis process.

Communication Estimation In general, the over-head of communicating data values between the be-haviors of the speci�cation can not be neglected whenevaluating possible target architectures. For behaviorsmapped to the same component, communication be-tween those behaviors is handled according the compo-nent's communicationmodel and is therefore includedin the behavior estimation. For example, on softwareprocessor the call overhead of pushing/popping valuesto/from the stack is part of the software implementa-tion of the caller and callee on the processor.

On the other hand, communication among behav-iors mapped to di�erent processing components re-quires transferring data over the channels in the sys-tem architecture. Estimates for the overhead of thiscommunication are taken into account during explo-ration. The main communication overhead is due tothe delay of transmitting values from one componentto the other.

Basic estimates about the time needed for commu-nication are obtained by evaluating the size of the datablock to be transmitted divided by the channel datarate. More elaborate estimates are obtained by con-sidering the protocol overhead of the channel includ-ing possibly the segmentation of the data block intosmaller packets for transmission.

3.3.2 Initial Simulation and Pro�ling

In the �rst step an analysis of the initial speci�cationindependent of any implementation is performed bypro�ling the speci�cation during initial simulations.Simulation of the initial speci�cation is necessary inany case to verify functional correctness. Therefore,these simulations can be augmented to obtain valuableinformation which will be used for implementation-dependent estimations and exploration in general.

Estimates about the relative computational com-plexity and the computational requirements of di�er-ent parts of the speci�cation are obtained by count-ing basic arithmetic operations (additions, multiplica-tions, etc.), logic operations (and, or, shift, etc.), andmemory access operations (moves) during simulation.For each behavior an operation histogram (Figure 11)is created in which the complexity of an execution ofthe behavior is broken down into the number of oc-currences of each basic operation.

Operation pro�les are summed according to thespeci�cation hierarchy such that total pro�les for each

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Occ

ure

nce

co

un

t

MOV

E Re

gAD

DM

AC AND

SHL

MUL

MOV

E M

em Operations

Figure 11: Sample operation pro�le.

constrained execution path are obtained. The compu-tational requirements for each path can then be cal-culated by summing the operation counts counti anddividing the sum by the path's timing constraint Tpath:

MOPS =

PcountiTpath

;

giving the complexity in million operations per second(MOPS). Note that these initial estimates are archi-tecture independent and therefore don't have to berecalculated during exploration.

In addition, with dynamic pro�ling informationabout the dynamic dependencies both among the be-haviors and inside the behaviors can be derived. Forexample, for data dependent loop bounds informationabout the worst case execution is obtained by countingthe number of loop iterations. In general, a dynamicanalysis of the possible paths through the speci�ca-tion and through the behaviors from inputs to outputsalong with the frequency at which each path is takenis performed and the results are stored for future esti-mations.

3.3.3 Estimation

In contrast to the architecture independent initial sim-ulation and pro�ling the actual estimation during ar-chitectural exploration is concerned with deriving themetrics for an implementation of the speci�cation onan underlying architecture. Using the previously col-lected pro�ling data a retargetable estimator is usedto obtain metrics for a wide range of HW and SW im-plementations. Depending on the stage in the explo-ration ow the estimator can work at di�erent levelsof accuracy in return for estimation speed.

12

Coarse Estimation For the initial exploration, arelative comparison of a large number of architectureshas to be done in order to select the best candidates.Therefore, at this point absolute accuracy is not ofutmost importance. The estimates should rather beobtained very quickly and provide a good relative ac-curacy, the so called �delity. On the other hand, inorder to evaluate di�erent architectures estimation ofan implementation on the currently selected target ar-chitecture has to be performed.

With the help of retargetable pro�lers and estima-tors the behaviors are analyzed statically on the basicblock level. For each basic block the number of cy-cles required for execution of that block on each ofthe allocated processors are estimated. With the ad-ditional dynamic information about the relations ofbasic block executions and execution frequencies thatwere obtained during simulation and pro�ling of theinitial speci�cation �nal estimates about the executiontimes of the behaviors on the allocated components arecomputed.

Estimation is very fast this way since both unopti-mized synthesis and basic block pro�ling can be per-formed quickly. Extensive, time-consuming simulationof the complete speci�cation is not necessary. Due tothe unoptimized nature of the implementations, theabsolute accuracy of these results is low. However,under the assumption that optimization gains are in-dependent of the actual target these estimates exhibita high �delity.

In addition, in each iteration of the exploration loopestimates have to be derived for the newly allocatedcomponents only. For previously allocated compo-nents the information about behavior execution timesfrom quick or accurate estimates of previous iterationsare retained.

Fine Estimation Later, as the design progresses,the solutions have to be evaluated in relation to thedesign constraints thus requiring estimates with highabsolute accuracy. However, the closer the design getsto a �nal implementation the smaller is the numberof architectures under consideration. Major decisionshave already been made. For each behavior the com-ponent on which it will execute is known after par-titioning. Since estimation is �xed to one implemen-tation per behavior more time can be spent on itsanalysis. Therefore, estimation run times are tradedo� for absolute accuracy.

Basically, accurate estimation creates an actual, op-timized hardware or software implementation of thebehaviors. Using the backend process, software is

compiled and hardware is synthesized with all opti-mizations enabled in the same manner as for the �naldesign. Accurate estimated are obtained by combin-ing a static analysis of the �nal hardware and softwareexecution times with the dynamic information com-puted during initial simulation and pro�ling. Sincestatic analysis is done only once for each basic block|the dynamic nature of multiple executions is capturedthrough the initial analysis information|estimation isfaster than extensive simulation of the complete im-plementation.

Only when a �nal architecture has been selectedand pushed through the backend, an overall simulationof the �nal design will be done in order to verify both,the functionality of the �nal implementation and thesatisfaction of design constraints.

3.3.4 Vocoder Analysis and Estimation

For the vocoder example, the goal was basically tocome up with the least-cost solution that satis�es thegiven timing constraints. Therefore, the speci�cationwas analyzed to obtain estimates about the executiontimes and hence eventually the actual delays of thedi�erent vocoder parts.

Initial Analysis Initially, analysis of the vocoderspeci�cation was performed with the goal of obtainingestimates about the relative computational complexityof the behaviors in the speci�cation. Computationalcomplexity is directly related to execution times ondi�erent platforms. The higher the complexity thelonger it will take to compute the result on any plat-form.

In case of the vocoder example, the C referenceimplementation of the vocoder standard (see Ap-pendix A) already provided initial estimates throughdynamic pro�ling by counting basic arithmetic, logicand memory access operations. The operations arecounted on a frame-per-frame basis, weighted accord-ing to their estimated relative complexity and �-nally combined into the so called WMOPS estimate(weighted million operations per second) by divid-ing the sum through the time allowed for each frame(20ms).

Figure 12 and Figure 15 show the WMOPS es-timates for the coder and the decoder, respectively.The estimates are broken down into the major partsLP analysis, closed-loop search, open-loop search andcodebook search for the coder, and LSP decoding, sub-frame decoding and post �ltering for the decoder. Foreach part both the total per frame and the major con-tributing subbehaviors are shown. Each subbehavior

13

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Total per frame Single execution

Resid

u

Levin

son

Az_ls

p

Q_pls

f_5

Auto

corr

Weig

ht_A

i

Syn_

filt

Resid

u

Pitch

_ol

Pred

_lt_6

Conv

olve

cor_

h_x

Syn_

filt

Pitch

_fr_

6

Syn_

filt

cor_

h

sear

ch_1

0i40

Wei

gh

ted

MO

PS

(W

MO

PS

)

LP

_an

alys

is

Op

en_l

oo

p

Clo

sed

_lo

op

Co

deb

oo

k

Figure 12: Estimates for computational complexity ofcoder parts.

0

200000

400000

600000

800000

1000000

1200000

First subframe Remaining frame Single execution

Co

deb

oo

k

��

��

��

��

��

��

��

��

��

Clo

sed

_lo

op

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Auto

corr

��

Q_pls

f_5

��

Az_ls

p

Weig

ht_A

i

Op

en_l

oo

p

Levin

son

��

��

��

Cyc

les

Syn_

filt

sear

ch_1

0i40

cor_

h

cor_

h_x

Syn_

filt

Pitch

_fr_

6

Resid

u

Pitch

_ol

Pred

_lt_6

Conv

olve

Resid

u

Syn_

filt��

��

LP

_an

alys

is

Figure 13: Breakdown of initial coder delays.

0 ms 10 ms 20 ms 30 ms 40 ms 50 ms

LP_analysis Open_loop Closed_loop Codebook��

��

��

Figure 14: Initial coder delay.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Total per frame Single execution

Int_

lpc

D_pls

f_5

d_ga

in_co

deag

c2

Syn_

filt

Pred

_lt_6

pree

mph

asis

Syn_

filt

Resid

uag

c

D_L

SP

Dec

od

e_12

k2

Po

st_f

ilter

Wei

gh

ted

MO

PS

(W

MO

PS

)

Figure 15: Estimates for computational complexity ofdecoder parts.

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

First subframe Remaining frame Single execution

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Int_

lpc

d_ga

in_co

de

dec

od

e_12

k2

Po

st_f

ilter

D_l

sp

Syn_

filt

Cyc

les

D_pls

f_5

agc2

Syn_

filt

Pred

_lt_6

pree

mph

asis

agc

Resid

u

Figure 16: Breakdown of initial decoder delays.

0 ms 1 ms 2 ms 3 ms 4 ms 5 ms 6 ms

D_lsp Decode_12k2 Post_filter��

��

��

Figure 17: Initial decoder delay.

14

of the di�erent vocoder parts can be executed sev-eral times per frame (e.g. in a loop or once for eachsubframe). Therefore, for the subbehaviors both thetotal complexity per frame and the complexity of asingle execution, i.e. the total complexity divided bythe number of executions are shown.

Accurate Estimation In order to obtain more ac-curate estimates about the real execution times on thetarget platform, a quick and straightforward compila-tion of the behaviors for the target processor underconsideration was done next.

Due to the unavailability of the retargetable estima-tors and pro�les, the target analysis of the vocoder wasdone by compiling the behavior code for the chosentarget processor (Motorola DSP56600 family, see Sec-tion 3.4) using the o�cial compiler Motorola providesfor their processors. The resulting machine code wasthen simulated on the Motorola Instruction Set Sim-ulator (ISS) to obtain cycle-accurate execution times.

Simulations were performed using typical inputdata. The results given here represent average execu-tion times per frame or per execution. An analysis ofthe code reveals that the execution times at this levelhave only minimal dynamic data dependencies in theorder of a few statements di�erence (at most 10% dif-ference between worst case and average case). Somesubbehaviors exhibit static execution time variationsthrough constant parameters passed into the routinesdepending on the calling environment but these vari-ations are averaged out at the higher levels of the hi-erarchy.

Figure 13 and Figure 16 show the execution timeresults (in number of cycles) for the coder and de-coder, again broken down into parts and their majorsubbehaviors. According to the timing budgets de-rived from the constraints in the initial speci�cation(see Section 2.2.4), the �gures show cycles for encod-ing or decoding of the �rst subframe plus the rest ofthe frame totaling in the cycles per complete frame.In addition, the cycles for a single execution of eachbehavior are included in order to show the complexityof each behavior in comparison with the initial esti-mates.

Finally, Figure 14 and Figure 17 show the sequencesof execution of the parts for one frame as time pro-gresses in the coder and decoder. The given times aregiven based on a processor clock frequency of 60MHz.As can be seen easily, for this initial implementationboth timing constraints are severely violated. For acomplete list of simulation results, see Appendix C,Section C.1.

As mentioned earlier, the results con�rm that thecoding part is the major contributor to the delays andthe computational complexity in general. Therefore,architectural exploration should focus on this part ofthe system.

Inside the coder, overall, the codebook search isthe most critical part since it contributes the most tothe violation of the two timing constraints. Also, thelargest indivisible leaf behavior in terms of executiontime is the actual search routine (search 10i40) in-side the codebook search.

Furthermore, comparing the actual execution timeswith the initial WMOPS estimates suggests that animplementation of the codebook search on the pro-cessor introduces a higher overhead of operations notdirectly related to the computation than the softwareimplementation of the other parts. Analysis of thespeci�cation shows that most of the time of the otherparts is spent inside very tight loops with loop bod-ies of a few statements only but a large loop countas they are typical in DSP-style applications. Theseloops promise very good opportunities for optimiza-tions even in software. An exception to this is thecodebook search since it is relatively irregular withlarge and hard to optimize code blocks.

3.4 Architecture Allocation

3.4.1 General Discussion

The �rst step in each iteration of the search throughthe architectural design space is the choice of an archi-tecture to be evaluated further. Architecture alloca-tion selects the types of components and the numberof components of each type in the system. Along withthe selection of processors, architecture allocation alsode�nes the general connectivity between the proces-sors by selecting the types and numbers of communi-cation components between processing components.

Typically, processing elements (PEs) are either pro-grammable processor cores running software or proces-sors implementing a �xed functionality in hardware.Under the aspect of design reuse the architecture caninclude prede�ned components taken out of an in-tellectual property (IP) library, either from in-housesources or provided by a third-party vendor. Usuallythe functionality of these IP components is �xed andlimited to a small range of possible behaviors. An ex-ception are the software processors. Although they aretaken out of the library of prede�ned IP componentsthey are fully programmable and able to implementbasically any behavior.

Possible system architectures range from pure soft-

15

(b)(a)

SW PE HW PE

HW PESW PE1

SW PE2

Figure 18: Examples of mixed HW/SW architectures.

ware solutions using one or more processing elementsup to full-custom hardware implementations with oneor more ASICs. In between is a vast range of mixedHW/SW architectures based on a combination of SWand HW processors (Figure 18).

3.4.2 Allocation Flow

Allocation successively walks through the design spaceguided by the information obtained during initial anal-ysis and estimation, selecting components out of thelibrary based on their characteristics. The goal is toselect architectures which promise to satisfy timingconstraints while staying within the given bounds forother constraints like cost and power.

Figure 19 shows the basic search tree for processorselection. Starting with no components allocated, ineach iteration a new component is added to the ar-chitecture, possibly replacing another one. With thehelp of heuristics and pruning techniques like branch-and-bound the breadth-�rst search through the tree isdirected towards promising architectures.

Component Selection Given the requirement toexecute a certain number of basic operations in a cer-tain amount of time a set of processing componentshas to be selected such that their computational powerpromises to run the operations in the given time. Thegoal is to satisfy the timing constraints while keepingthe cost low.

The principle of component selection is based onchoosing a set of processors that achieve the requiredMOPS rate as de�ned in Section 3.3. The spe-ci�c properties and capabilities of the components arematched with the properties of the instruction mix ofthe speci�cation in order to get more realistic esti-mates.

Each component in the library has an associatedset of operation weights wi that reect the number

SW

SW+HW SW+SW

SW+SW+HW

Figure 19: Allocation search tree.

of instruction cycles typically needed to execute eachbasic operation in the speci�cation. In addition, thecomponents are tagged with their MIPS rate (millioninstructions per second). Together with the opera-tion counts ci in the histogram obtained during initialanalysis and estimation the rate at which each com-ponent will approximately execute the speci�cation iscalculated to

MOPSC =

PciPwici

MIPSC :

An example of a graph is shown in Figure 20. Com-ponents are sorted by their cost. Although in generalthe computing power increases with increasing costthere are local maxima and minima in the operationrates. Basically, certain components match the op-eration pro�le better than others. For example, thevocoder pro�le includes a large number of multipli-cations and multiply-accumulates (MACs). A digitalsignal processor (DSP) with dedicated hardware mul-tipliers and MAC units that operate in one instructioncycle can therefore achieve a higher operation through-put than a general-purpose processor with the sameMIPS rate.

Given the local maxima of the operation rate graph,sets of components are then selected such that theiradded MOPS rates satisfy the MOPS requirement ofthe speci�cation, at the same time trying to minimizethe combined cost of the components.

Allocation Strategy During allocation, trade-o�sbetween resource and timing requirements are ex-plored. Depending on the parallelism available inthe speci�cation adding resources (processing compo-nents) increases cost but reduces the overall delay. Onthe other hand, reducing the number of componentsrequires serialization of the speci�cation and increasesthe delays. Orthogonal to the concept of parallelism,

16

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

design of a gsm vcad/publications/tech-reports/1999/tr...design of a gsm v o co der using sp ecc...

Documents