qualcomm hexagon dsp: an architecture optimized for mobile ... · memd(r6++m1) = r25:24 r20 =...

23
1 Qualcomm Technologies, Inc. All Rights Reserved Qualcomm Technologies, Inc. All Rights Reserved Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications Lucian Codrescu Sr. Director, Technology Qualcomm Technologies, Inc.

Upload: others

Post on 13-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

1Qualcomm Technologies, Inc. All Rights ReservedQualcomm Technologies, Inc. All Rights Reserved

Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications

Lucian CodrescuSr. Director, TechnologyQualcomm Technologies, Inc.

Page 2: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

2Qualcomm Technologies, Inc. All Rights Reserved

Camera

Display

JPEG

Video

Other

• aDSP: Real-time media & sensor processing

Hexagon™ DSP processors in Snapdragon products

Multimedia Fabric System Fabric

KraitCPUAdreno

GPU

KraitCPU

KraitCPU

KraitCPU

2MB L2

Misc.Connectivity

Modem

Snapdragon 800

Fabric & Memory Controller

LPDDR3 LPDDR3

HexagonaDSP

HexagonmDSP

• mDSP: Dedicated modem processing

Audio

Sensors

Page 3: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

3Qualcomm Technologies, Inc. All Rights Reserved

Expansion of Hexagon DSP use cases beyond audio

Image EnhancementCamera, Still, VideoHexagonV4 based products

VideoHexagonV5 based products

SensorsHexagonV5 based products

Computer Vision & Augmented RealityHexagonV4 based products

HexagonV2/V3

Voice Audio

Hexagon DSP is evolving for use beyond voice and audio to computer vision, video and imaging features

Page 4: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

4Qualcomm Technologies, Inc. All Rights Reserved

The Hexagon DSP evolutionGenerational improvements in performance and power efficiency driven by both architecture and implementation

V4L28nm

Apr 2011

V3M45nm

June 2009

V265nm

Dec 2007

V3C45nm Aug

2009

V3L45nm Nov

2009

V4M28nm

Dec 2010

V4C28nm

Dec 2010

V5A28nm

Dec 2012

V165nm

Oct 2006

Time

V5H28nm

Dec 2012

Page 5: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

5Qualcomm Technologies, Inc. All Rights Reserved

Requirements• Require fixed real-time

performance level (fps, Mbit/sec, etc.)

• Extremely aggressive power & area targets

Key characteristics of modem & multimedia applications

Characteristics• Mix of signal processing

& control code− For modem, Qualcomm does not

use a split CPU/DSP architecture.All processing is done on Hexagon DSP

− Multimedia apps have significant control in the RTOS & frameworks

• Heavy L2$ misses− Multimedia is data intensive− Modem is code intensive

Page 6: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

6Qualcomm Technologies, Inc. All Rights Reserved

Hexagon DSP blends features targeted to modem &multimedia

HexagonDSP

VLIW• Need multi-issue to

meet performance• Low complexity for

Area & Power

Innovate in ISA to maximize IPC

• More work/VLIW packet reduces energy/instruction

• Keep the pipelines full for MIPS/mm2

• Target both Signal Processing & Control code

Multi-Threading • To reduce L2$ miss

penalty without the need for a large L2

• Increases instructions/VLIW packet because compiler doesn’t need to schedule latency

Page 7: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

7Qualcomm Technologies, Inc. All Rights Reserved

Instruction Unit

VLIW: Area & power efficient multi-issue

Data Unit (Load/ Store/ ALU)

Data Unit (Load/ Store/ ALU)

Execution Unit

(64-bit Vector)

Execution Unit

(64-bit Vector)

Data Cache

L2 Cache / TCM

Instruction Cache

• Dual 64-bit load/store units

• Also 32-bit ALU

Variable sized instruction packets (1 to 4 instructions per Packet)

• Dual 64-bit execution units• Standard 8/16/32/64bit data

types• SIMD vectorized MPY / ALU

/ SHIFT, Permute, BitOps• Up to 8 16b MAC/cycle• 2 SP FMA/cycle

Register FileRegister File

Register File/Thread

• Unified 32x32bit General Register File is best for compiler.

• No separate Address or Accum Regs

• Per-Thread

DeviceDDR

Memory

Page 8: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

8Qualcomm Technologies, Inc. All Rights Reserved

Maximizing the signal processing code work/packetExample from inner loop of FFT: Executing 29 “simple RISC ops” in 1 cycle

{ R17:16 = MEMD(R0++M1)MEMD(R6++M1) = R25:24R20 = CMPY(R20, R8):<<1:rnd:satR11:10 = VADDH(R11:10, R13:12)}:endloop0

Complex multiply with round and saturation

Vector 4x16-bit Add

64-bit Load and

Zero-overhead loops•Dec count•Compare•Jump top

64-bit Store with post-update addressing

Page 9: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

9Qualcomm Technologies, Inc. All Rights Reserved

Maximizing the control code work/packetHexagon DSP ISA improves control code efficiency over traditional VLIW

Example C code void example(int *ptr, int val) {

if (ptr!=0) {*ptr = *ptr + val + 2;

}}

p0 = cmp.eq(r0,#0) {

if (!p0) r2=memw(r0)if (p0) jumpr:nt r31

}r2 = add(r2,#2)r1 = add(r1,r2)

{memw(r0) = r1jumpr r31

}

Instr/Packet = 7 instr/5 packets = 1.4

Tradional VLIW Assembly Code

③④

{p0 = cmp.eq (r0,#0) if (!p0.new) r2=memw(r0)if (p0.new) jumpr:nt r31

}r2 = add(r2,#2)r1 = add(r1,r2)

{memw(r0) = r1jumpr r31

}

Hexagon DSP:Dot-New Predication

②③

{p0 = cmp.eq(r0,#0) if (!p0.new) r2=memw(r0)if (p0.new) jumpr:nt r31

}r1 = add(r1,add(r2,#2))

{memw(r0) = r1jumpr r31

}

Hexagon DSP:Compound ALU

{p0 = cmp.eq(r0,#0) if (!p0.new) r2=memw(r0)if (p0.new) jumpr:nt r31

}{

r1 = add(r1,add(r2,#2))memw(r0) = r1.newjumpr r31

}

Hexagon DSP:New-Value Store

Instr/Packet = 7 instr/2packets = 3.5

Page 10: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

10Qualcomm Technologies, Inc. All Rights Reserved

High avg. instructions/packet for targeted use casesCompound instructions count as 2

AudioComputer Vision Video Imaging Control

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Average Instructions / VLIW Packet

Source: Qualcomm internal measurements

Page 11: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

11Qualcomm Technologies, Inc. All Rights Reserved

• Hexagon V5 includes three hardware threads• Architected to look like a multi-core with communication

through shared memory

Programmer’s view of Hexagon DSP HW multi-threading

Thread 0

DU

XU

Shared Data Cache

L2 Cache /

TCM

Register File

DU

XU

Thread 1

DU

XU

Register File

DU

XU

Thread 2

DU

XU

Register File

DU

XU

Shared Instruction Cache

Page 12: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

12Qualcomm Technologies, Inc. All Rights Reserved

• Number of threads match execution pipe depth(three threads three execute stages)

• All instructions complete before next packet dispatch• Compiler schedules for zero-latency which helps to increase

instructions/VLIW packet

Hexagon DSP V1-V4: Interleaved multi-threadingSimple round-robin thread scheduling

Thread 0 Dispatch

T0: { Ld Ld Add Cmp }

Thread 1 Dispatch

T1: { St Ld Mpy Add }

T0: { Ld Ld Add Cmp }

Thread 2 Dispatch

T2: { Ld Add Jump }

T1: { St Ld Mpy Add }

T0: { Ld Ld Add Cmp }

Page 13: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

13Qualcomm Technologies, Inc. All Rights Reserved

• Remove a thread from IMT rotation− On L2 cache misses− When in wait-for-interrupt or off

mode• Additional forwarding to support

2-cycle packets

• VLIW packets with dependencies between long latency instructions will stall− But many VLIW packets with

simple instructions can complete in 2 processor clocks

Hexagon DSP V5: Dynamic HW multi-threadingRecover some performance when threads idle or stalled

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Dhrystone DMIPS/MHz

IMT DMT

0

1

2

3

4

5

6

7

8

Coremarks/MHz

IMT DMT

Source: Qualcomm internal measurements

Page 14: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

14Qualcomm Technologies, Inc. All Rights Reserved

00.5

11.5

22.5

33.5

44.5

5

Aver

age

Inst

ruct

ions

/ C

ycle

IPC_DMTIPC_IMT

Hexagon DSP instructions per cycle

Single-Threaded Apps

Multi-Threaded Apps

Source: Qualcomm internal measurements

Page 15: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

15Qualcomm Technologies, Inc. All Rights Reserved

Qualcomm Hexagon DSP architectureHighly efficient mobile application processor—designed for more performance per MHz

Clock Rate (MHz) 430-520 100-233 300-700

DSP Performance (BDTImark2000) 4730-5720 1810-4220 5440-12660*

* - Projected best case score for 3-threads

Source: BDTI - For more detailed information see www.BDTI.com. All scores ©2013 BDTI

02468

101214161820

Mobile Competitor Qualcomm HXGN V4 (1 thread) Qualcomm HXGN V4 (3 threads)

DSP

Per

form

ance

per

MH

zB

DTI

mar

k200

0™/M

Hz

Page 16: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

16Qualcomm Technologies, Inc. All Rights Reserved

Hexagon DSP Power Benefits

Page 17: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

17Qualcomm Technologies, Inc. All Rights Reserved

MP3 playback power for competitive smartphones

Competitor A Qualcomm /Hexagon-based

Competitor B Competitor C Competitor D Competitor E Competitor F Competitor G

• Power measured at the battery for various phones• Includes everything: DSP, CPU, memory, analog components, etc

PowerLo

wer

is b

ette

r

Source: Qualcomm internal measurements

Page 18: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

18Qualcomm Technologies, Inc. All Rights Reserved

32% Less Power*7% Less Time52% Less CPU

Augmented Reality Java App finding objects in image using FastCV Feature Detect

Comparison of Feature Detect run on:• App CPU (ARM/Neon)• App DSP (Hexagon)

Computer vision offload – ARM/neon to Hexagon DSP

Detection Time (%) Total Device Power (%)CPU Utilization (%)

Source: Qualcomm internal measurements. * Power measured at the device battery

Page 19: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

19Qualcomm Technologies, Inc. All Rights Reserved

• Excellent near-linear power scalability(as threads go idle, power used by the thread is nearly eliminated)

• Achieved through optimized clock tree design & clock gating

Hexagon DSP power for different thread utilizations

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dhrystone Power, IMT Mode

Actual

Ideal

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

FIR Power,IMT Mode

Actual

Ideal

Source: Qualcomm internal measurements

Page 20: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

20Qualcomm Technologies, Inc. All Rights Reserved

Hexagon DSP Software Development

Page 21: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

21Qualcomm Technologies, Inc. All Rights Reserved

Independent Algorithm Developers on Hexagon DSP

Page 22: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

22Qualcomm Technologies, Inc. All Rights Reserved

Announcing the Hexagon DSP SDKSee the Hexagon DSP SDK in action at Uplinq2013 (www.uplinq.com)

Visit http://developer.qualcomm.com for more information.

Page 23: Qualcomm Hexagon DSP: An architecture optimized for mobile ... · MEMD(R6++M1) = R25:24 R20 = CMPY(R20, R8):

23Qualcomm Technologies, Inc. All Rights Reserved

For more information on Qualcomm, visit us at: www.qualcomm.com & www.qualcomm.com/blog

©2013 Qualcomm Technologies, Inc.Qualcomm and Hexagon are trademarks of QUALCOMM Incorporated, registered in the United States and other countries. All QUALCOMM Incorporated trademarks are used with permission. Other product and brand names may be trademarks or registered trademarks of their respective owners.Hexagon is a product of Qualcomm Technologies, Inc.

Thank youFollow us on: