energy efficientdata stream processingon ultra-low-power ... … · apb 128-bit axi 128-bit ahb...

Post on 13-Sep-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

EnergyEfficient DataStream Processing onUltra-Low-PowerEmbedded Multicore

Devices.

IvanWalulyaChalmersUniversityofTechnology

This project is part of the portfolio of theA.3 – Advanced Computing and Complex System UnitCommunications Networks, Content and Technology DGEuropean Commission

www.excess-project.euCopyright © 2013 - 2016 The EXCESS Consortium

Contract Number: 611183Total Cost [€]: 3.31 millionStarting Date: 2013-09-01

Duration: 36 months

New World Order

I. Walulya @ CTH 2

n Traditional DBMS: data stored in finite, persistent data sets

n Data is continuously growing faster than our ability to store or index it

n Data Streams: distributed, continuous, unbounded, rapid, time varying, noisy, . . .

n Data-Stream Sources:

n Network monitoring and traffic engineering

n Sensor networks

n Telecom call-detail records

n Financial applications

n Manufacturing processes

n Web logs and clickstreams

n Others…...

Real-time Stream Processing

I. Walulya @ CTH 3

Motivation:NetworkMonitoringQueries

DBMS(Oracle, DB2)

Back-end Data Warehouse

Off-line analysis –slow, expensive

DSL/CableNetworks

EnterpriseNetworks

Peer

Network OperationsCenter (NOC)

What are the top (most frequent) 1000 (source, dest) pairs seen over the last 1 hour?

SELECT COUNT (R1.source, R2.dest)FROM R1, R2WHERE R1.dest = R2.source

SQL Join Query

How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3?

Set-Expression Query

PSTN

n Store-then-process is not feasible!!!n Extra complexity comes from limited space and time

R1

R2

R3

Network OperationsCenter (NOC)

Real-time Stream Processing

I. Walulya @ CTH 4

Motivation:NetworkMonitoringQueries

n Must process network streams in real-time and one pass - spacen Critical NM tasks: fraud, DoS attacks, SLA violations - latency

n Real-time traffic engineering to improve utilizationn Tradeoff result accuracy vs. space/time/communication

n Fast responses, small space/timen Minimize use of communication resources

IPNetwork

PSTN

DSL/CableNetworks

BGP

Real-time Stream Processing

I. Walulya @ CTH 5

Example:ICU

H. CM Andrade, B. Gedik, and D. S. Turaga. "Fundamentals of Stream Processing.“, Cambridge University Press, 2014

Real-time Stream Processing

I. Walulya @ CTH 6

Example:Cyber-PhysicalSystems(CPS)

http://www.kapsch.net/se/

Processing:• On-the-fly• distributed• alsoparallel…

Real-time Stream Processing

I. Walulya @ CTH 7

WhatisDataStreaming?

n Data Stream Processingn Alternative to the store-and-processn Data Processed in real timen Suitable for systems processing huge amounts of data

n Data Streamsn Flow of tuples, each containing application related datan distributed, continuous, unbounded, rapid, time varying, noisy, . . .

Real-time Stream Processing

I. Walulya @ CTH 8

DataStreaming:Requirements

n High throughputn Low latencyn Determinism

n Same output for same input – regardless of #cores

<2,blue>

<1,red>

<3,red>

Filterred

Counttuples

Alertif…

<2,red>

Operator Operator

Real-time Stream Processing

I. Walulya @ CTH 9

WhatisStreamAggregation?

n Data summarizationn General form:

n select G, F1 from S where P group by G having F2 n G: grouping attributes, F1,F2: aggregate expressions

n Window techniques are needed!n Aggregate expressions:

n distributive: sum, count, min, maxn algebraic: avgn holistic: count-distinct, median

LowComputationCosts

Real-time Stream Processing

I. Walulya @ CTH 10

MultiwayStreamAggregation

n Multiple streams of incoming tuples n Windows:

n Time-Based Windows: n Count-Based Windows: n Sliding windows vs Tumbling windows

n 4 Stages:n Add stage: Fetch tuples from each input stream.n Merge stage: Merge and sort fetched tuples according to

timestamps.n Update stage: Update the state of windows a tuple contributes ton Output stage: Forward output tuples to the next aggregation stage.

Real-time Stream Processing

MultiwayStreamAggregation

5 3 2

7 6 4

11 7 3

12 4 1

queuesoftuples

Real-time Stream Processing

MultiwayStreamAggregation

5 3

7 6 4

add

11 7

12 4

2 1 3

queuesoftuples

Real-time Stream Processing

MultiwayStreamAggregation

5

7 6

add

11 7

12

3 4 4 1 2 3

sort

Real-time Stream Processing

MultiwayStreamingAggregation

14

7 6 1-4 3-6 5- 8

add

11 7

12

5 6 7 3 4 41

2

3

update17 15

3

sort

queuesoftuples

Real-time Stream Processing

MultiwayStreamingAggregation

n Input: Raw data converted to tuples and stored in queues.n Output: A flow of tuples with the aggregated values.

14

7 3-6 5- 8

add

116 7 12 5 6 7

sort update output

1-4

1

2

3

val.

17 15

3

3

4

4

1316 14

3

Real-time Stream Processing

I. Walulya @ CTH 16

Whylow-powerembeddedsystems?

n Salient characteristics:n Heavy reliance on data transfersn Relatively low computations per byten Relatively small amounts of data at a time

n Modern multi/many-core embedded systems:n Low latency programmable local storage vs cachesn high-bandwidth access to main memoryn VPU and ILP enabledn Ultra-low power • Communicationvscomputationcosts,

• memoryaccesspatternsand• granularityofdataaccesspatterns.

Data Streaming on embedded systems

I. Walulya @ CTH 17

Designchallenges!

n how stream aggregation can map to the different parallel architectures is still an open problem

n Potential of such low power processors for use in high end computations.

n Can high-performance computing techniques be deployed on these processors?

n Addressing Hardware constraints n Understanding memory access patterns in their algorithms

in relation to the computation

Data Streaming on embedded systems

I. Walulya @ CTH 18

ParallelStreamAggregation

ConcurrentDataStructures:• Usedbetweendifferentstagesofaggregationprocess

forcommunicationpurposes.• Sharedataacrossdifferentthreads/processes• Allowfordata-parallelism• Loadbalancingontheworkload

n Tuples from each input stream placed in queues by multiple threads

n A consumer thread performing merge, update and output stages One final aggregator used

How?Synchronization

Data Streaming on embedded systems

I. Walulya @ CTH 19

Concurrentdatastructures:SynchronizationTechniques

n Coarse grained lockingn Easy but slow...

n Fine grained lockingn Fast/scalable but: error-

prone, not composable, deadlocks

n Non-blockingn Based on atomic

hardware primitives (e.g. TAS, CAS)

n Good progress guarantees (lock/wait-freedom)

n Scalable

Fig.Yiannis Nikolakopoulos

Data Streaming on embedded systems

I. Walulya @ CTH 20

Concurrentdatastructures:QueueBuffers

n Single Producer Single Consumer (SPSC)n Lamport 1983 : Lamport Queuen Giacomoni et al. 2008 : FastForward Queuen Lee et al. 2009 : MCRingBuffern Preud'homme et al. 2010 : BatchQueue

n Multi Producer Multi Consumer (MPMC)n Michael & Scott 1997 : MS-Queue (1-lock, 2-lock)n Mellor-Crummey 2016 : Fetch-and-Add Queuen Message-Passing based queues

Target Architecture

I. Walulya @ CTH 21

Myriad1architecturehighlights Myriad2architecturehighlights

DDR Controller

128kB 2-way L2 cache (SHAVE)

32kB LRAM

4kB 2-wayI-cache

4kB 2-wayD-cache

LEON3RISC

VRF 32x128

I RF 32x32

(12 ports)

(17 ports)

DCU

IDC

1kBD-cache PEU BRU VAUIAULSU0 LSU1 SAU CMU

SHAVE VLIW Vector Processor

x 8 SHAVEs128-bitCMX InstrPort

64-bitCMXPort

64-bitCMXPort

32-bitAPB

128-bit AXI 128-bit AHB

1MB CMX SRAM

SRF 32x32 (12 ports)

128/256MB LPDDR2/3 Stacked Die

DDR Controller

256kB 2-way L2 cache (SHAVE)

2MB CMX SRAM

256kB 4-wayL2 cache (LEON4)

32kB 2-wayI-cache (LEON4)

32kB 2-wayD-cache (LEON4)

LEON4RISC2

32kB 4-wayL2 cache (LEON4)

4kB 2-wayI-cache (LEON4)

4kB 2-wayD-cache (LEON4)

LEON4RISC1

VRF 32x128

I RF 32x32

(10 ports)

(17 ports)

DCU

IDC

1kBD-cache

1kBI-cache PEU BRU VAUIAULSU0 LSU1 SAU CMU

SHAVE VLIW Vector Processor

x 12 SHAVEs128-bitPorts

64-bitCMXPort

64-bitCMXPort32-bit

APB

128-bit AXI 128-bit AHB 128-bit AHB

Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands.

Ø Hardware support for SIMD, matrix transpose,sparse data, sqrt@fp16, predicated execution...

Ø Heterogeneous SoC: 1 Leon3@fp64 + 8 Shaves@fp32.

Ø 32KB LRAM, 1MB CMX, 16/64MB DDR, DMAs.Ø Power efficiency of 1Tops/W (max 8-bit

equivalent).Ø FIFO buffers

Ø 28nm ultra-low power (≤ 0.5W@600MHz) with 17 power islands.

Ø Extended hardware support over Myriad 1: clock-gating, hard-wired configurable accelerators for imaging and vision, etc.

Ø Heterogeneous SoC: 2 Leon4@fp64 + 12 Shaves@fp32.

Ø 256+32KB LRAM, 2MB CMX, DDR3 support, DMAs. Power efficiency of 2Tops/W(max 16-bit equivalent).

Ø FIFO buffers

Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems

22

n One producer feeding tuples to ten aggregators in a roundn Three producers feeding tuples to 8 aggregators n One final aggregator used n All processes run on SHAVES n Queues placed in CMX slices of aggregators

SingleProducerVariation

Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems

23

SingleProducerVariation

Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems

24

n All processes run on SHAVES n Queues placed in CMX slices of aggregators n Three producers feeding tuples to 8 aggregators n One final aggregator used

Threeproducervariation

Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems

25

Threeproducersvariation

Streaming Aggregation Operator Customization In Embedded Systems

I. Walulya @ CTH 26

Streamingaggregationdesignspace

n Category A: n consists of decision trees that refer to memory configuration and

allocation

n Category B:n are assigned decision trees related to data movement and means by

which accesses to shared resources are synchronized

Streaming Aggregation Operator Customization In Embedded Systems

I. Walulya @ CTH 27

Metric2..................

Application Constraints

Hardware Constraints

Remove non-applicable options from the design space

Exploration for all customized streaming aggregation implementations

STEP 1: Design space exploration

step 1output:

Throughput, latency, energy, scalabilityfor each customized implementation

STEP 2: Identification of Pareto efficient

implementations

Throughput vs. memory sizeLatency vs. energy consumption

Scalability

Customized streaming aggregation implementation

...

Methodology output

Metric17.41647.41597.35627.35677.33657.3336

Q160: A1(loc), A2(loc), …, B4(b.w.)Q160: A1(loc), A2(loc), …, B4(p.s.)Q320: A1(loc), A2(loc), …, B4(b.w.)Q320: A1(loc), A2(loc), …, B4(p.s.)Q640: A1(loc), A2(loc), …, B4(p.s.)

......

Implementations evaluated:

7.25

7.3

7.35

7.4

7.45

40 60 80 100

P1

Metric1 vs. Metric2

P2P3

P4M

etric

1Metric2

Input:

METHODOLOGY

EXAMPLE

Streaming Aggregation Operator Customization In Embedded Systems

I. Walulya @ CTH 28

Evaluationsetup

n Dataset: Soundcloud (user id, timestamp, song id, comment)n Query: user id with the highest number of comments.n Platforms: Myriad1 (8 cores), Myriad2 (12 cores). n Evaluation metrics: Throughput, Memory size, Latency, energy

consumption

Streaming Aggregation Operator Customization In Embedded Systems

I. Walulya @ CTH 29

Multiwaystreamingaggregationresults:throughput,latency,energyandmemory

Streaming Aggregation Operator Customization In Embedded Systems

I. Walulya @ CTH 30

Performanceperwatt

Latency(usec) Throughput (t/sec) (t/sec)/watt

Myriad1 140.38 123,622 379,041

Myriad2 39.8 497,154 1,004,766

Intel XeonE5 15 1,105,221 18,412

n x20 highest performance per watt in Myriad1n x54 highest performance per watt in Myriad2

Conclusions

I. Walulya @ CTH 31

n Designed efficient concurrent data structure implementations for

embedded system applications.

n Evaluation of a concurrent data structure implementation model

based on message-passing. Design space exploration of streaming

aggregation implementation on embedded architectures.

n Data Streaming: Major departure from traditional persistent

database paradigm

n Fundamental re-thinking of models, assumptions, algorithms, system

architectures, …

I. Walulya @ CTH 32

References

I. Walulya @ CTH 33

1. Lamport L.: Specifying Concurrent program modules. ACM Transactions on Programming Languages and Systems 5, (1983), 190 -222

2. Giacomoni, J., Moseley, T., Vachharajani, M.: FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ACM (2008) 43-52

3. Preud'homme, T., Sopena, J., Thomas, G., Folliot, B.: BatchQueue: Fast and Memory-Thrifty Core to Core Communication. In: 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). (2010) 215-222

4. Lee, P.P.C., Bu, T., Chandranmenon, G.: A lock-free, cache-efficient shared ring Buffer for multi-core architectures. In: Proceedings of the 5th ACM/IEEE Symposium on architectures for Networking and Communications Systems. ANCS '09, New York, NY, USA, ACM (2009) 78-79

5. Michael, M., Scott, M.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on Principles of distributed computing, ACM (1996) 267-275

6. Tsigas, P., Zhang, Y.: A Wait-free Queue As Fast As Fetch-and-add. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP(2016) 16:1--16:13

Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems

34

1:Dataarefetchedfromoff-chipDDR,inCMXusingDMA.(L2Cacheisenabledinnormalmode).2:TuplesarecopiedfromoneCMXslicetoanotherCMXsliceusingmemcpy orDMA.3:4BytepointersaretransferredconstantlyfromoneSHAVEtoanotherfordeterminingwhichwindowsshouldberemoved.

Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems

35

Threeproducersvariation

Queues

36

1000

5

10

15

20

25

1 2 4 6 8 10 12Shaves

Throughput(Mops/s)

1−lock 2−lock FAAQueue

Queues

37

1000

500

600

700

800

1 2 4 6 8 10 12Shaves

Power(mW)

1−lock 2−lock FAAQueue

Queues

38

1000

50

100

150

1 2 4 6 8 10 12Shaves

Ener

gy p

er O

pera

tion

(mJ/

op)

1−lock 2−lock FAAQueue

top related