qualcomm hexagon dsp: an architecture optimized for mobile

23
1 Qualcomm Technologies, Inc. All Rights Reserved Qualcomm Technologies, Inc. All Rights Reserved Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications Lucian Codrescu Sr. Director, Technology Qualcomm Technologies, Inc.

Upload: ledieu

Post on 05-Feb-2017

239 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Qualcomm Hexagon DSP: An architecture optimized for mobile

1

Qualcomm Technologies, Inc. All Rights Reserved Qualcomm Technologies, Inc. All Rights Reserved

Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications

Lucian Codrescu Sr. Director, Technology Qualcomm Technologies, Inc.

Page 2: Qualcomm Hexagon DSP: An architecture optimized for mobile

2

Qualcomm Technologies, Inc. All Rights Reserved

Camera

Display

JPEG

Video

Other

• aDSP: Real-time media & sensor processing

Hexagon™ DSP processors in Snapdragon products

Multimedia Fabric System Fabric

Krait CPU Adreno

GPU

Krait CPU

Krait CPU

Krait CPU

2MB L2

Misc. Connectivity

Modem

Snapdragon 800

Fabric & Memory Controller

LPDDR3 LPDDR3

Hexagon aDSP

Hexagon mDSP

• mDSP: Dedicated modem processing

Audio

Sensors

Page 3: Qualcomm Hexagon DSP: An architecture optimized for mobile

3

Qualcomm Technologies, Inc. All Rights Reserved

Expansion of Hexagon DSP use cases beyond audio

Image Enhancement Camera, Still, Video HexagonV4 based products

Video HexagonV5 based products

Sensors HexagonV5 based products

Computer Vision & Augmented Reality HexagonV4 based products

HexagonV2/V3

Voice Audio

Hexagon DSP is evolving for use beyond voice and audio to computer vision, video and imaging features

Page 4: Qualcomm Hexagon DSP: An architecture optimized for mobile

4

Qualcomm Technologies, Inc. All Rights Reserved

The Hexagon DSP evolution Generational improvements in performance and power efficiency driven by both architecture and implementation

V4L 28nm

Apr 2011

V3M 45nm

June 2009

V2 65nm

Dec 2007

V3C 45nm Aug

2009

V3L 45nm Nov

2009

V4M 28nm

Dec 2010

V4C 28nm

Dec 2010

V5A 28nm

Dec 2012

V1 65nm

Oct 2006

Time

V5H 28nm

Dec 2012

Page 5: Qualcomm Hexagon DSP: An architecture optimized for mobile

5

Qualcomm Technologies, Inc. All Rights Reserved

Requirements • Require fixed real-time

performance level (fps, Mbit/sec, etc.)

• Extremely aggressive power & area targets

Key characteristics of modem & multimedia applications

Characteristics • Mix of signal processing

& control code − For modem, Qualcomm does not

use a split CPU/DSP architecture. All processing is done on Hexagon DSP

− Multimedia apps have significant control in the RTOS & frameworks

• Heavy L2$ misses − Multimedia is data intensive − Modem is code intensive

Page 6: Qualcomm Hexagon DSP: An architecture optimized for mobile

6

Qualcomm Technologies, Inc. All Rights Reserved

Hexagon DSP blends features targeted to modem & multimedia

Hexagon DSP

VLIW • Need multi-issue to

meet performance • Low complexity for

Area & Power

Innovate in ISA to maximize IPC

• More work/VLIW packet reduces energy/instruction

• Keep the pipelines full for MIPS/mm2

• Target both Signal Processing & Control code

Multi-Threading • To reduce L2$ miss

penalty without the need for a large L2

• Increases instructions/VLIW packet because compiler doesn’t need to schedule latency

Page 7: Qualcomm Hexagon DSP: An architecture optimized for mobile

7

Qualcomm Technologies, Inc. All Rights Reserved

Instruction Unit

VLIW: Area & power efficient multi-issue

Data Unit (Load/ Store/ ALU)

Data Unit (Load/ Store/ ALU)

Execution Unit

(64-bit Vector)

Execution Unit

(64-bit Vector)

Data Cache

L2 Cache / TCM

Instruction Cache

• Dual 64-bit load/store units

• Also 32-bit ALU

Variable sized instruction packets (1 to 4 instructions per Packet)

• Dual 64-bit execution units • Standard 8/16/32/64bit data

types • SIMD vectorized MPY / ALU

/ SHIFT, Permute, BitOps • Up to 8 16b MAC/cycle • 2 SP FMA/cycle

Register File Register File

Register File/Thread

• Unified 32x32bit General Register File is best for compiler.

• No separate Address or Accum Regs

• Per-Thread

Device DDR

Memory

Page 8: Qualcomm Hexagon DSP: An architecture optimized for mobile

8

Qualcomm Technologies, Inc. All Rights Reserved

Maximizing the signal processing code work/packet Example from inner loop of FFT: Executing 29 “simple RISC ops” in 1 cycle

Rs

Add

I R

Rt

*32

<<0-1

*32

<<0-1

Rd

I R

Add

I R

*32

<<0-1

*32

<<0-1

I R

Rs

Rt

-0x80000x8000

Sat_32 Sat_32

High 16bitsHigh 16bits

I R

+ + + +

{ R17:16 = MEMD(R0++M1) MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):<<1:rnd:sat R11:10 = VADDH(R11:10, R13:12) }:endloop0

Complex multiply with round and saturation

Vector 4x16-bit Add

64-bit Load and

Zero-overhead loops • Dec count

• Compare

• Jump top

64-bit Store with post-update addressing

Page 9: Qualcomm Hexagon DSP: An architecture optimized for mobile

9

Qualcomm Technologies, Inc. All Rights Reserved

Maximizing the control code work/packet Hexagon DSP ISA improves control code efficiency over traditional VLIW

Example C code void example(int *ptr, int val) { if (ptr!=0) { *ptr = *ptr + val + 2; }}

p0 = cmp.eq(r0,#0) { if (!p0) r2=memw(r0) if (p0) jumpr:nt r31 } r2 = add(r2,#2) r1 = add(r1,r2) { memw(r0) = r1 jumpr r31 }

Instr/Packet = 7 instr/5 packets = 1.4

Tradional VLIW Assembly Code

③ ④

{ p0 = cmp.eq (r0,#0) if (!p0.new) r2=memw(r0) if (p0.new) jumpr:nt r31 } r2 = add(r2,#2) r1 = add(r1,r2) { memw(r0) = r1 jumpr r31 }

Hexagon DSP: Dot-New Predication

② ③

{ p0 = cmp.eq(r0,#0) if (!p0.new) r2=memw(r0) if (p0.new) jumpr:nt r31 } r1 = add(r1,add(r2,#2)) { memw(r0) = r1 jumpr r31 }

Hexagon DSP: Compound ALU

{ p0 = cmp.eq(r0,#0) if (!p0.new) r2=memw(r0) if (p0.new) jumpr:nt r31 } { r1 = add(r1,add(r2,#2)) memw(r0) = r1.new jumpr r31 }

Hexagon DSP: New-Value Store

Instr/Packet = 7 instr/2packets = 3.5

Page 10: Qualcomm Hexagon DSP: An architecture optimized for mobile

10

Qualcomm Technologies, Inc. All Rights Reserved

High avg. instructions/packet for targeted use cases Compound instructions count as 2

Audio Computer Vision Video Imaging Control

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Av

era

ge

In

str

uc

tio

ns /

VL

IW P

ac

ke

t

Source: Qualcomm internal measurements

Page 11: Qualcomm Hexagon DSP: An architecture optimized for mobile

11

Qualcomm Technologies, Inc. All Rights Reserved

• Hexagon V5 includes three hardware threads • Architected to look like a multi-core with communication

through shared memory

Programmer’s view of Hexagon DSP HW multi-threading

Thread 0

D

U

X

U

Shared Data Cache

L2 Cache /

TCM

Register File

D

U

X

U

Thread 1

D

U

X

U

Register File

D

U

X

U

Thread 2

D

U

X

U

Register File

D

U

X

U

Shared Instruction Cache

Page 12: Qualcomm Hexagon DSP: An architecture optimized for mobile

12

Qualcomm Technologies, Inc. All Rights Reserved

• Number of threads match execution pipe depth (three threads three execute stages)

• All instructions complete before next packet dispatch • Compiler schedules for zero-latency which helps to increase

instructions/VLIW packet

Hexagon DSP V1-V4: Interleaved multi-threading Simple round-robin thread scheduling

Thread 0 Dispatch

T0: { Ld Ld Add Cmp }

Thread 1 Dispatch

T1: { St Ld Mpy Add }

T0: { Ld Ld Add Cmp }

Thread 2 Dispatch

T2: { Ld Add Jump }

T1: { St Ld Mpy Add }

T0: { Ld Ld Add Cmp }

Page 13: Qualcomm Hexagon DSP: An architecture optimized for mobile

13

Qualcomm Technologies, Inc. All Rights Reserved

• Remove a thread from IMT rotation − On L2 cache misses − When in wait-for-interrupt or off

mode • Additional forwarding to support

2-cycle packets

• VLIW packets with dependencies between long latency instructions will stall

− But many VLIW packets with simple instructions can complete in 2 processor clocks

Hexagon DSP V5: Dynamic HW multi-threading Recover some performance when threads idle or stalled

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Dhrystone DMIPS/MHz

IMT DMT

0

1

2

3

4

5

6

7

8

Coremarks/MHz

IMT DMT

Source: Qualcomm internal measurements

Page 14: Qualcomm Hexagon DSP: An architecture optimized for mobile

14

Qualcomm Technologies, Inc. All Rights Reserved

00.5

11.5

22.5

33.5

44.5

5

Aver

age

Inst

ruct

ions

/ C

ycle

IPC_DMTIPC_IMT

Hexagon DSP instructions per cycle

Single-Threaded Apps

Multi-Threaded Apps

Source: Qualcomm internal measurements

Page 15: Qualcomm Hexagon DSP: An architecture optimized for mobile

15

Qualcomm Technologies, Inc. All Rights Reserved

Qualcomm Hexagon DSP architecture Highly efficient mobile application processor—designed for more performance per MHz

Clock Rate (MHz) 430-520 100-233 300-700 DSP Performance (BDTImark2000) 4730-5720 1810-4220 5440-12660*

* - Projected best case score for 3-threads

Source: BDTI - For more detailed information see www.BDTI.com. All scores ©2013 BDTI

02468

101214161820

Mobile Competitor Qualcomm HXGN V4 (1 thread) Qualcomm HXGN V4 (3 threads)

DSP

Per

form

ance

per

MH

z B

DTI

mar

k200

0™/M

Hz

Page 16: Qualcomm Hexagon DSP: An architecture optimized for mobile

16

Qualcomm Technologies, Inc. All Rights Reserved

Hexagon DSP Power Benefits

Page 17: Qualcomm Hexagon DSP: An architecture optimized for mobile

17

Qualcomm Technologies, Inc. All Rights Reserved

MP3 playback power for competitive smartphones

Competitor A Qualcomm /Hexagon-based

Competitor B Competitor C Competitor D Competitor E Competitor F Competitor G

• Power measured at the battery for various phones • Includes everything: DSP, CPU, memory, analog components, etc

Power Lo

wer

is b

ette

r

Source: Qualcomm internal measurements

Page 18: Qualcomm Hexagon DSP: An architecture optimized for mobile

18

Qualcomm Technologies, Inc. All Rights Reserved

Augmented Reality Java Application

Call Feature Detect

ARM Only FastCV Library

Feature Detect

Function

App CPU

FastCV Call Router

Hexagon (QDSP6) FastCV Library

Feature Detect

Function

App DSPVeNum

ARM/VeNum FastCV Library

Feature Detect

Function

32% Less Power* 7% Less Time 52% Less CPU

Augmented Reality Java App finding objects in image using FastCV Feature Detect

Comparison of Feature Detect run on: • App CPU (ARM/Neon) • App DSP (Hexagon)

Computer vision offload – ARM/neon to Hexagon DSP

Detection Time (%)

Total Device Power (%)

CPU Utilization (%)

Source: Qualcomm internal measurements. * Power measured at the device battery

Page 19: Qualcomm Hexagon DSP: An architecture optimized for mobile

19

Qualcomm Technologies, Inc. All Rights Reserved

• Excellent near-linear power scalability (as threads go idle, power used by the thread is nearly eliminated)

• Achieved through optimized clock tree design & clock gating

Hexagon DSP power for different thread utilizations

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dhrystone Power, IMT Mode

Actual

Ideal

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

FIR Power, IMT Mode

Actual

Ideal

Source: Qualcomm internal measurements

Page 20: Qualcomm Hexagon DSP: An architecture optimized for mobile

20

Qualcomm Technologies, Inc. All Rights Reserved

Hexagon DSP Software Development

Page 22: Qualcomm Hexagon DSP: An architecture optimized for mobile

22

Qualcomm Technologies, Inc. All Rights Reserved

Announcing the Hexagon DSP SDK See the Hexagon DSP SDK in action at Uplinq2013 (www.uplinq.com)

Visit http://developer.qualcomm.com for more information.

Page 23: Qualcomm Hexagon DSP: An architecture optimized for mobile

23

Qualcomm Technologies, Inc. All Rights Reserved

For more information on Qualcomm, visit us at: www.qualcomm.com & www.qualcomm.com/blog

©2013 Qualcomm Technologies, Inc. Qualcomm and Hexagon are trademarks of QUALCOMM Incorporated, registered in the United States and other countries. All QUALCOMM Incorporated trademarks are used with permission. Other product and brand names may be trademarks or registered trademarks of their respective owners. Hexagon is a product of Qualcomm Technologies, Inc.

Thank you

Follow us on: