1/21 1,2 m. i. taj, 2 o. hammami and 1 m. akil 1 esiee, noisy le grand cedex, france....

21
1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {[email protected]} 2 ENSTA, Bvd.Victor, Paris, France {[email protected]} December, 14, 2010 SDR waveform components implementation on single FPGA Multiprocessor Platform

Upload: hannah-atkins

Post on 31-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

1/21

1,2M. I. Taj, 2O. Hammami and 1M. Akil 1ESIEE, Noisy Le Grand Cedex, France. {[email protected]}

2ENSTA, Bvd.Victor, Paris, France {[email protected]} December, 14, 2010

SDR waveform components implementation on single FPGA Multiprocessor Platform

Page 2: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

2/21

Talk Overview

Motivation and Context

Major Contribution

Software Defined Radio (SDR) Platforms

Related Work

Embedded Multiprocessor platform and Processor

FFT and Viterbi Parallelization Strategy on MPSoC

Performance Evaluation Results

Conclusion

Page 3: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

Motivation and Context

Software Defined Radio is entering the mainstream Challenges of Software Defined Radio (SDR) lie in the implementation of

compact embedded multiprocessor platform for wireless mobile terminals Performance requirements under resource constraints make SDR

implementation a challenging task Reconfigurable embedded mutiprocessor cores are increasingly being used

in implementing SDR solutions The two most important SDR waveform components:• FFT.• Viterbi Decoding This work is a step towards an efficient SDR waveform implementation

over the Microblaze multiprocessor environment, developed at ENSTA.

3/21

Page 4: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

Major Contribution

This work addresses the mapping of FFT and Viterbi Decoding on embedded multiprocessor platform.

A significant speed-up has been achieved.

This shows that a whole wave-form including all the DSP algorithms can be more efficiently implemented on the addressed multiprocessor platform.

4/21

Page 5: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

SDR Platforms-Introduction

Massive parallel SDR Baseband platforms with • ILP: Instruction Level Parallelism• DLP: Data Level Parallelism• MP: Multiprocessor ILP and DLP asks for well fit algorithms as their architecture and

associated compiler requirements results in difficult mapping of various advanced algorithms.

• Typical example is SODA This calls for MP parallelism, however there exists little effort in

employing SDR architecture on multiprocessor architecture.• Parallel Processing computing element => Multiprocessor FPGA platform

developed at ENSTA

5/21

Page 6: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

Related Work

SDR performance has been evaluated in numerous platforms but very few efforts have been made to parallelize the SDR waveforms.

Single Multiprocessor efforts• 1999 => Reed and Cummings described the transition of FPGAs from ASICs to embedded

products, long ago.• 2008 => OSSIE Signal Processing Library functions are mapped on single processor

platform, Xilinx ML-403 board that is based on Virtex-4 FX FPGA to identify the functions to be optimized.

• 2008 => G. Abgrall estimated Latency, CPU load and memory utilization of OSSIE.

Parallelizing efforts• 2008 => M.Palkovic parallelized the baseband processing in a space division multiplexing

(SDM)-orthogonal frequency division multiplexing (OFDM) on a platform called ADRES.• 2008 => D. Cabric addresses feasibility of Cognitive Radios by implementing five Spectrum

Sensing Algorithms on Berkeley Emulation Engine (BEE) that consists of five Xilinx V-2 FPGAs with each FPGA embedded a Power PC 405.

• 2009=> We parallelized OSSIE SigProc Filter functions over the same platform.

6/21

Page 7: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

SDR waveform design

SDR Wavefrom Components:• Filter Functions• Algebraic Functions• Modulation/Demodulation Functions.

Advance Functionality => Massive intercommunication.

General purpose hardware.• Strong emergence of MPSoC.• actual prototypes ( not simulation )

7/21

SDR waveform Communication Architecture

FFT.• Multi-resolution spectrum

sensing. Viterbi Decoding• Efficient bandwidth

utilization.

Page 8: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

NoC based MPSoC, programming and evaluation platform

8/21

Alpha-Data Board Architecture

NoC Based Multiprocessor SoC

Xilinx Virtex4FX140

4 Banks DDR2 (1 Gb)

Micoblaze v. 6.00. b

Local memory 32 kB

Page 9: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

Embedded Processing Element

Microblaze softcore V 7.00b Instruction side Local Memory Bus (ILMB)Data side Local Memory Bus (DLMB) ILMB controllerDLMB controllerBRAM for Local Memory (32KBytes) Fast simplex link (FSL) OCP-IP AdapterOptinal:Timer, local memory, PCI express, UART

MicroBlaze 32bits RISC Softcore processor harvard architectureHighly configurableInterfacesPLB, OPB, LMB, Upto 16 FSL

Microblaze Core Block Diagram

Microblaze based Processing Element

9/21

Page 10: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

Parallelization Primitives and their usage

10/21

Syn_start_work( )• Give slaves the command to start to calculate the respective coefficients

Syn_work_finished( )• Check if slaves have finished calculating their share of coefficients.

Syn_wait_for_start ( ) • Each slave waits for master to give this signal before starting and followed by this signal, it starts its assigned calculations.

barrier ( )• Once each slave is finished its assigned filter coefficient calculation, it uses this function to tell master that slave have finished their assigned tasks. After all the slaves finish executing this function, then master captures the number of clock cycles.

Page 11: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

FFT Parallelization Description - 1

Radix-2 DIT Algorithm.

FFT => Two phases:

1. Step 0 to step log (n/p)-1

2. Step log (n/p) to step log (n-1)• n => number of transformed points of DFT.

=> divided into p consecutive subsequences.• No intercommunication in first phase.• In second phase: p/2 PEs compute in parallel.• Each PE computes 2n/p points. • One step => One data exchange.• Each PE computes 2n/p points with its peer’s help.

11/21

Page 12: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

FFT Parallelization Description - 2

12/21Phase 1 of FFT

Page 13: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

FFT Parallelization Description - 3

13/21Phase 2 of FFT

Page 14: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

Viterbi Decoding Parallelization Description - 1

14/21

Two passes:• Forward Pass => Parallel Implementation • Backward pass.

Forward Pass => Likelihood calculation over all possible states.

• Most likely predecessor opt_J = argmax(J) log{[prob. system is state J at t-1]+[prob. Of transition from J to I]}

• Update probability of I.

prob I= log{[prob of opt_J at t-1]+transition prob from opt_J to I]+ emission prob.state I gave rise to obs at t]}

• Bookkeeping.

Page 15: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

Viterbi Decoding Parallelization Description - 2

15/21Forward pass of Viterbi

decoding

Page 16: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

Performance Evaluation Results on MPSoC - 1

16/21

Fast Fourier TransformTest Mode Number of clock

cyclesSpeed up

Single processor 2,087,142 -4 processors 668956 3.128 processors 348438 5.99

16 processors 187,693 11.12

Viterbi DecodingTest Mode Number of clock

cyclesSpeed up

Single processor 1667911 -4 processors 483452 3.458 processors 277522 6.01

16 processors 145287 11.48

Speed-up for Viterbi Decoding Multi-processor platform.

Speed-up for FFT using Multi-processor platform.

Page 17: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

Performance Evaluation Results on MPSoC - 2

17/21

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 16

Number of processors

Sp

eed

up

Viterbi Decoding

FFT

Number of Processors

Spe

ed u

p

Speed up is evaluated by the Master PE.

Different slave processors access the DDR via NoC after they finish their share of calculation.

Master and Slaves invoke appropriate synchronization primitives.

Speed-up for FFT and Viterbi Decoding using MPSoC

Page 18: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

Conclusion

Multicore and embedded multiprocessor architectures are strongly emerging for SDR applications.

FFT and Viterbi Decoding are the most important components of SDR waveform.

This work enhanced our SDR waveform functionality by optimized parallel implementation of the two most important SDR waveform components.

Speed-up achieved answers the ITRS Roadmap prediction for SDR.

18/21

Page 19: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

19/21

Thank You

Questions ???

Page 20: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

20/21

References-1

1. www.wirelessinnovation.org

2. http://www.itrs.net

3. M.I. Taj, O.Hammami and K. Huggins“Performance Evaluation of SDR on embedded platform: The case of OSSIE” The 2nd IEEE IC-4 Best Paper Award

4. M.I. Taj, K. Huggins and O.Hammami “OSSIE Signal Processing Functions Performance Enhancements through Parallelization in an embedded multiprocessor architecture” 2009 Software Defined Radio Technical Conference and Product Exposition, Madrid, Spain, April,2009

5. R.J.Lackey and D.W.Upmal,”SPEAKeasy:The Military Software Radio”, IEEE Commun.Mag. vol 33, no.5, May 1995, pp.56-61.

6. M.Woh, Y.Lin, S.Seo,S.Mahle, T.Mudge and C.Chakrabarti,”From SODA to Scotch: The Evolution of a Wireless Baseband Processor”, 41st IEEE/ACM International Symposium on Micoarchitecture 8-12 Nov.2008. Page(s) 152-163: Location: Como, Italy.

7. J.Glossner, D.Iancu, M.Moudgill, G.Nacer, S.Jinturkar, S.Stanley and M.Schulte, “The Sandbridge SB3011 Platform”, EURASIP Journal on Embedded Systems, Volume 2007 , Issue 1 (January 2007)

8. B.Bougard, B.D.Sutter, S.Rabou, D.Novo, O.Allam, S.Dupont, L.V.derPerre, “A Coarse-Grained Array based Baseband Processor for 100 Mbps+Software Defined Radio”,Design, Automation and Test in Europe,March10-14, 2008. Munich, Germany.

9. M.Palkovic, H.Cappelle, M. Bougard, L.Van der Perre, “Mapping of 40 MHz MIMO SDM-OFDM Baseband Processing on Multi-Processor SDR Platform”, The 11th workshop on Design and Diagnostics of Electronic Circuits and Systems, 16-18 April, 2008

10. B.Mei, S.Vernalde, D.Verkest,H.De Man, R.Lauwereins, “ADRES: an architecture with tightly coupled VLIW processor and coarse-grained configurable matrix”, Proc IEEE Conf. on Field Programmable Logic and its Applications (FPL), Lisbon, Portugal, pp.61-70, Sep, 2003.

Page 21: 1/21 1,2 M. I. Taj, 2 O. Hammami and 1 M. Akil 1 ESIEE, Noisy Le Grand Cedex, France. {tajm@esiee.fr}{tajm@esiee.fr} 2 ENSTA, Bvd.Victor, Paris, France

21/21

References-2

11. A.N.Choudhary,S.Das, N. Ahuja and H. Patel”A Reconfigurable and Hierarchical Parallel Processing Architecture: Performance Results for Stereo Vision” 10th International Conference on Pattern Recognition, 1990. Proceedings Volume ii, Issue , 16-21 Jun 1990 Page(s):389 - 393 vol.2.

12. J.H. Bahn, J. Yang and N. Bagherzadeh,”Parallel FFT Algorithms on Network-on-Chips”, Fifth International Conference on Information Technology: New Generations Pages.1087-1093. Year of publication:2008. ISBN:978-0-7965-3099-4

13. A.H.Kamalizad, C.Pan and N. Bagherzadeh,”Fast Parallel FFT on a Reconfigurable Computation Platform” 15th Symposium on Computer Architecture and High Performance Computing, 2003. Proceedings. Volume , Issue , 10-12 Nov. 2003 Page(s): 254 - 259

14. G.Zhong, F.Xu and A.N.Wilson,”A Power-Scalable Reconfigurable FFT/IFFT IC Based on a Multi-Processor Ring”, IEEE Journal of Solid State Circuits, Vol. 41. No.2, February, 2006. Page(s):483-495, ISSN:0018-9200

15. Xilinx Virtex-4 http://www.xilinx.com/products/silicon_solutions/fpgas/virtex/virtex4/index.htm

16. OCP-IPOpenCoreProtocolSpecification2.2.pdf http://www.ocpip.org/home, 2008

17. Microblaze processor reference guide. Xilinx user guide 081 (v.7.0)

18. Arteris S.A http://www.arteris.com/

19. J.S. Reeve, K.Amarasingh “A parallel Viterbi decoder for Block Cyclic and Convolution codes” IEEE Signal Processing, 2005.Page(s) 273-278.

20. M.Monchiero, G.Palermo, C.Silvano, O.Villa, Exploration of distributed shared memory architectures for NoC-based multiprocessors, Journal of Systems Architecture, Volume 53, Issue 10, October 2007, Pages 719-732.