energy efficientdata stream processingon ultra-low-power ... … · apb 128-bit axi 128-bit ahb...

EnergyEfficient DataStream Processing onUltra-Low-PowerEmbedded Multicore

Devices.

IvanWalulyaChalmersUniversityofTechnology

This project is part of the portfolio of theA.3 – Advanced Computing and Complex System UnitCommunications Networks, Content and Technology DGEuropean Commission

Contract Number: 611183Total Cost [€]: 3.31 millionStarting Date: 2013-09-01

Duration: 36 months

New World Order

I. Walulya @ CTH 2

n Traditional DBMS: data stored in finite, persistent data sets

n Data is continuously growing faster than our ability to store or index it

n Data Streams: distributed, continuous, unbounded, rapid, time varying, noisy, . . .

n Data-Stream Sources:

n Network monitoring and traffic engineering

n Sensor networks

n Telecom call-detail records

n Financial applications

n Manufacturing processes

n Web logs and clickstreams

n Others…...

Real-time Stream Processing

I. Walulya @ CTH 3

Motivation:NetworkMonitoringQueries

DBMS(Oracle, DB2)

Back-end Data Warehouse

Off-line analysis –slow, expensive

DSL/CableNetworks

EnterpriseNetworks

Network OperationsCenter (NOC)

What are the top (most frequent) 1000 (source, dest) pairs seen over the last 1 hour?

SELECT COUNT (R1.source, R2.dest)FROM R1, R2WHERE R1.dest = R2.source

SQL Join Query

How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3?

Set-Expression Query

n Store-then-process is not feasible!!!n Extra complexity comes from limited space and time

Network OperationsCenter (NOC)

I. Walulya @ CTH 4

Motivation:NetworkMonitoringQueries

n Must process network streams in real-time and one pass - spacen Critical NM tasks: fraud, DoS attacks, SLA violations - latency

n Real-time traffic engineering to improve utilizationn Tradeoff result accuracy vs. space/time/communication

n Fast responses, small space/timen Minimize use of communication resources

IPNetwork

DSL/CableNetworks

I. Walulya @ CTH 5

Example:ICU

H. CM Andrade, B. Gedik, and D. S. Turaga. "Fundamentals of Stream Processing.“, Cambridge University Press, 2014

I. Walulya @ CTH 6

Example:Cyber-PhysicalSystems(CPS)

http://www.kapsch.net/se/

Processing:• On-the-fly• distributed• alsoparallel…

I. Walulya @ CTH 7

WhatisDataStreaming?

n Data Stream Processingn Alternative to the store-and-processn Data Processed in real timen Suitable for systems processing huge amounts of data

n Data Streamsn Flow of tuples, each containing application related datan distributed, continuous, unbounded, rapid, time varying, noisy, . . .

I. Walulya @ CTH 8

DataStreaming:Requirements

n High throughputn Low latencyn Determinism

n Same output for same input – regardless of #cores

<2,blue>

<1,red>

<3,red>

Filterred

Counttuples

Alertif…

<2,red>

Operator Operator

I. Walulya @ CTH 9

WhatisStreamAggregation?

n Data summarizationn General form:

n select G, F1 from S where P group by G having F2 n G: grouping attributes, F1,F2: aggregate expressions

n Window techniques are needed!n Aggregate expressions:

n distributive: sum, count, min, maxn algebraic: avgn holistic: count-distinct, median

LowComputationCosts

I. Walulya @ CTH 10

MultiwayStreamAggregation

n Multiple streams of incoming tuples n Windows:

n Time-Based Windows: n Count-Based Windows: n Sliding windows vs Tumbling windows

n 4 Stages:n Add stage: Fetch tuples from each input stream.n Merge stage: Merge and sort fetched tuples according to

timestamps.n Update stage: Update the state of windows a tuple contributes ton Output stage: Forward output tuples to the next aggregation stage.

11 7 3

12 4 1

queuesoftuples

3 4 4 1 2 3

MultiwayStreamingAggregation

7 6 1-4 3-6 5- 8

5 6 7 3 4 41

update17 15

queuesoftuples

MultiwayStreamingAggregation

n Input: Raw data converted to tuples and stored in queues.n Output: A flow of tuples with the aggregated values.

7 3-6 5- 8

116 7 12 5 6 7

sort update output

1316 14

I. Walulya @ CTH 16

Whylow-powerembeddedsystems?

n Salient characteristics:n Heavy reliance on data transfersn Relatively low computations per byten Relatively small amounts of data at a time

n Modern multi/many-core embedded systems:n Low latency programmable local storage vs cachesn high-bandwidth access to main memoryn VPU and ILP enabledn Ultra-low power • Communicationvscomputationcosts,

• memoryaccesspatternsand• granularityofdataaccesspatterns.

Data Streaming on embedded systems

I. Walulya @ CTH 17

Designchallenges!

n how stream aggregation can map to the different parallel architectures is still an open problem

n Potential of such low power processors for use in high end computations.

n Can high-performance computing techniques be deployed on these processors?

n Addressing Hardware constraints n Understanding memory access patterns in their algorithms

in relation to the computation

I. Walulya @ CTH 18

ParallelStreamAggregation

ConcurrentDataStructures:• Usedbetweendifferentstagesofaggregationprocess

forcommunicationpurposes.• Sharedataacrossdifferentthreads/processes• Allowfordata-parallelism• Loadbalancingontheworkload

n Tuples from each input stream placed in queues by multiple threads

n A consumer thread performing merge, update and output stages One final aggregator used

How?Synchronization

I. Walulya @ CTH 19

Concurrentdatastructures:SynchronizationTechniques

n Coarse grained lockingn Easy but slow...

n Fine grained lockingn Fast/scalable but: error-

prone, not composable, deadlocks

n Non-blockingn Based on atomic

hardware primitives (e.g. TAS, CAS)

n Good progress guarantees (lock/wait-freedom)

n Scalable

Fig.Yiannis Nikolakopoulos

I. Walulya @ CTH 20

Concurrentdatastructures:QueueBuffers

n Single Producer Single Consumer (SPSC)n Lamport 1983 : Lamport Queuen Giacomoni et al. 2008 : FastForward Queuen Lee et al. 2009 : MCRingBuffern Preud'homme et al. 2010 : BatchQueue

n Multi Producer Multi Consumer (MPMC)n Michael & Scott 1997 : MS-Queue (1-lock, 2-lock)n Mellor-Crummey 2016 : Fetch-and-Add Queuen Message-Passing based queues

Target Architecture

I. Walulya @ CTH 21

Myriad1architecturehighlights Myriad2architecturehighlights

DDR Controller

128kB 2-way L2 cache (SHAVE)

32kB LRAM

4kB 2-wayI-cache

4kB 2-wayD-cache

LEON3RISC

VRF 32x128

I RF 32x32

(12 ports)

(17 ports)

1kBD-cache PEU BRU VAUIAULSU0 LSU1 SAU CMU

SHAVE VLIW Vector Processor

x 8 SHAVEs128-bitCMX InstrPort

64-bitCMXPort

32-bitAPB

128-bit AXI 128-bit AHB

1MB CMX SRAM

SRF 32x32 (12 ports)

128/256MB LPDDR2/3 Stacked Die

DDR Controller

256kB 2-way L2 cache (SHAVE)

2MB CMX SRAM

256kB 4-wayL2 cache (LEON4)

32kB 2-wayI-cache (LEON4)

32kB 2-wayD-cache (LEON4)

LEON4RISC2

32kB 4-wayL2 cache (LEON4)

4kB 2-wayI-cache (LEON4)

4kB 2-wayD-cache (LEON4)

LEON4RISC1

VRF 32x128

I RF 32x32

(10 ports)

(17 ports)

1kBD-cache

1kBI-cache PEU BRU VAUIAULSU0 LSU1 SAU CMU

SHAVE VLIW Vector Processor

x 12 SHAVEs128-bitPorts

64-bitCMXPort

64-bitCMXPort32-bit

128-bit AXI 128-bit AHB 128-bit AHB

Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands.

Ø Hardware support for SIMD, matrix transpose,sparse data, sqrt@fp16, predicated execution...

Ø Heterogeneous SoC: 1 Leon3@fp64 + 8 Shaves@fp32.

Ø 32KB LRAM, 1MB CMX, 16/64MB DDR, DMAs.Ø Power efficiency of 1Tops/W (max 8-bit

equivalent).Ø FIFO buffers

Ø 28nm ultra-low power (≤ 0.5W@600MHz) with 17 power islands.

Ø Extended hardware support over Myriad 1: clock-gating, hard-wired configurable accelerators for imaging and vision, etc.

Ø Heterogeneous SoC: 2 Leon4@fp64 + 12 Shaves@fp32.

Ø 256+32KB LRAM, 2MB CMX, DDR3 support, DMAs. Power efficiency of 2Tops/W(max 16-bit equivalent).

Ø FIFO buffers

Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems

n One producer feeding tuples to ten aggregators in a roundn Three producers feeding tuples to 8 aggregators n One final aggregator used n All processes run on SHAVES n Queues placed in CMX slices of aggregators

SingleProducerVariation

n All processes run on SHAVES n Queues placed in CMX slices of aggregators n Three producers feeding tuples to 8 aggregators n One final aggregator used

Threeproducervariation

Threeproducersvariation

Streaming Aggregation Operator Customization In Embedded Systems

I. Walulya @ CTH 26

Streamingaggregationdesignspace

n Category A: n consists of decision trees that refer to memory configuration and

allocation

n Category B:n are assigned decision trees related to data movement and means by

which accesses to shared resources are synchronized

I. Walulya @ CTH 27

Metric2..................

Application Constraints

Hardware Constraints

Remove non-applicable options from the design space

Exploration for all customized streaming aggregation implementations

STEP 1: Design space exploration

step 1output:

Throughput, latency, energy, scalabilityfor each customized implementation

STEP 2: Identification of Pareto efficient

implementations

Throughput vs. memory sizeLatency vs. energy consumption

Scalability

Customized streaming aggregation implementation

Methodology output

Metric17.41647.41597.35627.35677.33657.3336

Q160: A1(loc), A2(loc), …, B4(b.w.)Q160: A1(loc), A2(loc), …, B4(p.s.)Q320: A1(loc), A2(loc), …, B4(b.w.)Q320: A1(loc), A2(loc), …, B4(p.s.)Q640: A1(loc), A2(loc), …, B4(p.s.)

......

Implementations evaluated:

40 60 80 100

Metric1 vs. Metric2

1Metric2

Input:

METHODOLOGY

EXAMPLE

I. Walulya @ CTH 28

Evaluationsetup

n Dataset: Soundcloud (user id, timestamp, song id, comment)n Query: user id with the highest number of comments.n Platforms: Myriad1 (8 cores), Myriad2 (12 cores). n Evaluation metrics: Throughput, Memory size, Latency, energy

consumption

I. Walulya @ CTH 29

Multiwaystreamingaggregationresults:throughput,latency,energyandmemory

I. Walulya @ CTH 30

Performanceperwatt

Latency(usec) Throughput (t/sec) (t/sec)/watt

Myriad1 140.38 123,622 379,041

Myriad2 39.8 497,154 1,004,766

Intel XeonE5 15 1,105,221 18,412

n x20 highest performance per watt in Myriad1n x54 highest performance per watt in Myriad2

Conclusions

I. Walulya @ CTH 31

n Designed efficient concurrent data structure implementations for

embedded system applications.

n Evaluation of a concurrent data structure implementation model

based on message-passing. Design space exploration of streaming

aggregation implementation on embedded architectures.

n Data Streaming: Major departure from traditional persistent

database paradigm

n Fundamental re-thinking of models, assumptions, algorithms, system

architectures, …

I. Walulya @ CTH 32

References

I. Walulya @ CTH 33

1. Lamport L.: Specifying Concurrent program modules. ACM Transactions on Programming Languages and Systems 5, (1983), 190 -222

2. Giacomoni, J., Moseley, T., Vachharajani, M.: FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ACM (2008) 43-52

3. Preud'homme, T., Sopena, J., Thomas, G., Folliot, B.: BatchQueue: Fast and Memory-Thrifty Core to Core Communication. In: 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). (2010) 215-222

4. Lee, P.P.C., Bu, T., Chandranmenon, G.: A lock-free, cache-efficient shared ring Buffer for multi-core architectures. In: Proceedings of the 5th ACM/IEEE Symposium on architectures for Networking and Communications Systems. ANCS '09, New York, NY, USA, ACM (2009) 78-79

5. Michael, M., Scott, M.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on Principles of distributed computing, ACM (1996) 267-275

6. Tsigas, P., Zhang, Y.: A Wait-free Queue As Fast As Fetch-and-add. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP(2016) 16:1--16:13

1:Dataarefetchedfromoff-chipDDR,inCMXusingDMA.(L2Cacheisenabledinnormalmode).2:TuplesarecopiedfromoneCMXslicetoanotherCMXsliceusingmemcpy orDMA.3:4BytepointersaretransferredconstantlyfromoneSHAVEtoanotherfordeterminingwhichwindowsshouldberemoved.

Threeproducersvariation

Queues

1 2 4 6 8 10 12Shaves

Throughput(Mops/s)

1−lock 2−lock FAAQueue

Queues

1 2 4 6 8 10 12Shaves

Power(mW)

Queues

1 2 4 6 8 10 12Shaves

energy efficientdata stream processingon ultra-low-power ... … · apb 128-bit axi 128-bit ahb...

Documents

128-bit aes decryption - columbia...

en29gl128 128 megabit (16384k x 8-bit / 8192k x 16-bit

amd64 architecture programmer's manual volume 6: 128-bit

the state of ipv6 (and ipv4) - bgp expert wide.pdf · •...

breaking ‘128-bit secure’ supersingular binary...

analysis and design of high performance 128-bit parallel...

ahb i2c verification

an4667 application note - stmicroelectronics › resource...

title page 128-bit processor local bus architecture...

128-bit processor local bus - xilinx · 2021. 4. 7. ·...

high density wifi - edgewater wirelesseap-sim) wpa2 –...

arm amba 5 ahb protocol specification ahb5, ahb-lite

shake-128 sample of 0-bit message - nist

en29gl256 256 megabit (32768k x 8-bit / 16384k x 16-bit...

128-bit and 256-bit xop and fma4 instructions

128-bit aes decryption

en29gl128 128 megabit (16384k x 8-bit / 8192k x 16-bit...

128 bit carry select adder

128-bit sse5 instruction set and supplemental 64-bit media...

analysis and design of high performance 128-bit parallel...