![Page 1: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/1.jpg)
EnergyEfficient DataStream Processing onUltra-Low-PowerEmbedded Multicore
Devices.
IvanWalulyaChalmersUniversityofTechnology
This project is part of the portfolio of theA.3 – Advanced Computing and Complex System UnitCommunications Networks, Content and Technology DGEuropean Commission
www.excess-project.euCopyright © 2013 - 2016 The EXCESS Consortium
Contract Number: 611183Total Cost [€]: 3.31 millionStarting Date: 2013-09-01
Duration: 36 months
![Page 2: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/2.jpg)
New World Order
I. Walulya @ CTH 2
n Traditional DBMS: data stored in finite, persistent data sets
n Data is continuously growing faster than our ability to store or index it
n Data Streams: distributed, continuous, unbounded, rapid, time varying, noisy, . . .
n Data-Stream Sources:
n Network monitoring and traffic engineering
n Sensor networks
n Telecom call-detail records
n Financial applications
n Manufacturing processes
n Web logs and clickstreams
n Others…...
![Page 3: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/3.jpg)
Real-time Stream Processing
I. Walulya @ CTH 3
Motivation:NetworkMonitoringQueries
DBMS(Oracle, DB2)
Back-end Data Warehouse
Off-line analysis –slow, expensive
DSL/CableNetworks
EnterpriseNetworks
Peer
Network OperationsCenter (NOC)
What are the top (most frequent) 1000 (source, dest) pairs seen over the last 1 hour?
SELECT COUNT (R1.source, R2.dest)FROM R1, R2WHERE R1.dest = R2.source
SQL Join Query
How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3?
Set-Expression Query
PSTN
n Store-then-process is not feasible!!!n Extra complexity comes from limited space and time
R1
R2
R3
![Page 4: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/4.jpg)
Network OperationsCenter (NOC)
Real-time Stream Processing
I. Walulya @ CTH 4
Motivation:NetworkMonitoringQueries
n Must process network streams in real-time and one pass - spacen Critical NM tasks: fraud, DoS attacks, SLA violations - latency
n Real-time traffic engineering to improve utilizationn Tradeoff result accuracy vs. space/time/communication
n Fast responses, small space/timen Minimize use of communication resources
IPNetwork
PSTN
DSL/CableNetworks
BGP
![Page 5: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/5.jpg)
Real-time Stream Processing
I. Walulya @ CTH 5
Example:ICU
H. CM Andrade, B. Gedik, and D. S. Turaga. "Fundamentals of Stream Processing.“, Cambridge University Press, 2014
![Page 6: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/6.jpg)
Real-time Stream Processing
I. Walulya @ CTH 6
Example:Cyber-PhysicalSystems(CPS)
http://www.kapsch.net/se/
Processing:• On-the-fly• distributed• alsoparallel…
![Page 7: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/7.jpg)
Real-time Stream Processing
I. Walulya @ CTH 7
WhatisDataStreaming?
n Data Stream Processingn Alternative to the store-and-processn Data Processed in real timen Suitable for systems processing huge amounts of data
n Data Streamsn Flow of tuples, each containing application related datan distributed, continuous, unbounded, rapid, time varying, noisy, . . .
![Page 8: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/8.jpg)
Real-time Stream Processing
I. Walulya @ CTH 8
DataStreaming:Requirements
n High throughputn Low latencyn Determinism
n Same output for same input – regardless of #cores
<2,blue>
<1,red>
<3,red>
Filterred
Counttuples
Alertif…
<2,red>
Operator Operator
![Page 9: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/9.jpg)
Real-time Stream Processing
I. Walulya @ CTH 9
WhatisStreamAggregation?
n Data summarizationn General form:
n select G, F1 from S where P group by G having F2 n G: grouping attributes, F1,F2: aggregate expressions
n Window techniques are needed!n Aggregate expressions:
n distributive: sum, count, min, maxn algebraic: avgn holistic: count-distinct, median
LowComputationCosts
![Page 10: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/10.jpg)
Real-time Stream Processing
I. Walulya @ CTH 10
MultiwayStreamAggregation
n Multiple streams of incoming tuples n Windows:
n Time-Based Windows: n Count-Based Windows: n Sliding windows vs Tumbling windows
n 4 Stages:n Add stage: Fetch tuples from each input stream.n Merge stage: Merge and sort fetched tuples according to
timestamps.n Update stage: Update the state of windows a tuple contributes ton Output stage: Forward output tuples to the next aggregation stage.
![Page 11: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/11.jpg)
Real-time Stream Processing
MultiwayStreamAggregation
5 3 2
7 6 4
11 7 3
12 4 1
queuesoftuples
![Page 12: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/12.jpg)
Real-time Stream Processing
MultiwayStreamAggregation
5 3
7 6 4
add
11 7
12 4
2 1 3
queuesoftuples
![Page 13: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/13.jpg)
Real-time Stream Processing
MultiwayStreamAggregation
5
7 6
add
11 7
12
3 4 4 1 2 3
sort
![Page 14: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/14.jpg)
Real-time Stream Processing
MultiwayStreamingAggregation
14
7 6 1-4 3-6 5- 8
add
11 7
12
5 6 7 3 4 41
2
3
update17 15
3
sort
queuesoftuples
![Page 15: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/15.jpg)
Real-time Stream Processing
MultiwayStreamingAggregation
n Input: Raw data converted to tuples and stored in queues.n Output: A flow of tuples with the aggregated values.
14
7 3-6 5- 8
add
116 7 12 5 6 7
sort update output
1-4
1
2
3
val.
17 15
3
3
4
4
1316 14
3
![Page 16: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/16.jpg)
Real-time Stream Processing
I. Walulya @ CTH 16
Whylow-powerembeddedsystems?
n Salient characteristics:n Heavy reliance on data transfersn Relatively low computations per byten Relatively small amounts of data at a time
n Modern multi/many-core embedded systems:n Low latency programmable local storage vs cachesn high-bandwidth access to main memoryn VPU and ILP enabledn Ultra-low power • Communicationvscomputationcosts,
• memoryaccesspatternsand• granularityofdataaccesspatterns.
![Page 17: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/17.jpg)
Data Streaming on embedded systems
I. Walulya @ CTH 17
Designchallenges!
n how stream aggregation can map to the different parallel architectures is still an open problem
n Potential of such low power processors for use in high end computations.
n Can high-performance computing techniques be deployed on these processors?
n Addressing Hardware constraints n Understanding memory access patterns in their algorithms
in relation to the computation
![Page 18: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/18.jpg)
Data Streaming on embedded systems
I. Walulya @ CTH 18
ParallelStreamAggregation
ConcurrentDataStructures:• Usedbetweendifferentstagesofaggregationprocess
forcommunicationpurposes.• Sharedataacrossdifferentthreads/processes• Allowfordata-parallelism• Loadbalancingontheworkload
n Tuples from each input stream placed in queues by multiple threads
n A consumer thread performing merge, update and output stages One final aggregator used
How?Synchronization
![Page 19: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/19.jpg)
Data Streaming on embedded systems
I. Walulya @ CTH 19
Concurrentdatastructures:SynchronizationTechniques
n Coarse grained lockingn Easy but slow...
n Fine grained lockingn Fast/scalable but: error-
prone, not composable, deadlocks
n Non-blockingn Based on atomic
hardware primitives (e.g. TAS, CAS)
n Good progress guarantees (lock/wait-freedom)
n Scalable
Fig.Yiannis Nikolakopoulos
![Page 20: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/20.jpg)
Data Streaming on embedded systems
I. Walulya @ CTH 20
Concurrentdatastructures:QueueBuffers
n Single Producer Single Consumer (SPSC)n Lamport 1983 : Lamport Queuen Giacomoni et al. 2008 : FastForward Queuen Lee et al. 2009 : MCRingBuffern Preud'homme et al. 2010 : BatchQueue
n Multi Producer Multi Consumer (MPMC)n Michael & Scott 1997 : MS-Queue (1-lock, 2-lock)n Mellor-Crummey 2016 : Fetch-and-Add Queuen Message-Passing based queues
![Page 21: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/21.jpg)
Target Architecture
I. Walulya @ CTH 21
Myriad1architecturehighlights Myriad2architecturehighlights
DDR Controller
128kB 2-way L2 cache (SHAVE)
32kB LRAM
4kB 2-wayI-cache
4kB 2-wayD-cache
LEON3RISC
VRF 32x128
I RF 32x32
(12 ports)
(17 ports)
DCU
IDC
1kBD-cache PEU BRU VAUIAULSU0 LSU1 SAU CMU
SHAVE VLIW Vector Processor
x 8 SHAVEs128-bitCMX InstrPort
64-bitCMXPort
64-bitCMXPort
32-bitAPB
128-bit AXI 128-bit AHB
1MB CMX SRAM
SRF 32x32 (12 ports)
128/256MB LPDDR2/3 Stacked Die
DDR Controller
256kB 2-way L2 cache (SHAVE)
2MB CMX SRAM
256kB 4-wayL2 cache (LEON4)
32kB 2-wayI-cache (LEON4)
32kB 2-wayD-cache (LEON4)
LEON4RISC2
32kB 4-wayL2 cache (LEON4)
4kB 2-wayI-cache (LEON4)
4kB 2-wayD-cache (LEON4)
LEON4RISC1
VRF 32x128
I RF 32x32
(10 ports)
(17 ports)
DCU
IDC
1kBD-cache
1kBI-cache PEU BRU VAUIAULSU0 LSU1 SAU CMU
SHAVE VLIW Vector Processor
x 12 SHAVEs128-bitPorts
64-bitCMXPort
64-bitCMXPort32-bit
APB
128-bit AXI 128-bit AHB 128-bit AHB
Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands.
Ø Hardware support for SIMD, matrix transpose,sparse data, sqrt@fp16, predicated execution...
Ø Heterogeneous SoC: 1 Leon3@fp64 + 8 Shaves@fp32.
Ø 32KB LRAM, 1MB CMX, 16/64MB DDR, DMAs.Ø Power efficiency of 1Tops/W (max 8-bit
equivalent).Ø FIFO buffers
Ø 28nm ultra-low power (≤ 0.5W@600MHz) with 17 power islands.
Ø Extended hardware support over Myriad 1: clock-gating, hard-wired configurable accelerators for imaging and vision, etc.
Ø Heterogeneous SoC: 2 Leon4@fp64 + 12 Shaves@fp32.
Ø 256+32KB LRAM, 2MB CMX, DDR3 support, DMAs. Power efficiency of 2Tops/W(max 16-bit equivalent).
Ø FIFO buffers
![Page 22: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/22.jpg)
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
22
n One producer feeding tuples to ten aggregators in a roundn Three producers feeding tuples to 8 aggregators n One final aggregator used n All processes run on SHAVES n Queues placed in CMX slices of aggregators
SingleProducerVariation
![Page 23: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/23.jpg)
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
23
SingleProducerVariation
![Page 24: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/24.jpg)
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
24
n All processes run on SHAVES n Queues placed in CMX slices of aggregators n Three producers feeding tuples to 8 aggregators n One final aggregator used
Threeproducervariation
![Page 25: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/25.jpg)
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
25
Threeproducersvariation
![Page 26: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/26.jpg)
Streaming Aggregation Operator Customization In Embedded Systems
I. Walulya @ CTH 26
Streamingaggregationdesignspace
n Category A: n consists of decision trees that refer to memory configuration and
allocation
n Category B:n are assigned decision trees related to data movement and means by
which accesses to shared resources are synchronized
![Page 27: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/27.jpg)
Streaming Aggregation Operator Customization In Embedded Systems
I. Walulya @ CTH 27
Metric2..................
Application Constraints
Hardware Constraints
Remove non-applicable options from the design space
Exploration for all customized streaming aggregation implementations
STEP 1: Design space exploration
step 1output:
Throughput, latency, energy, scalabilityfor each customized implementation
STEP 2: Identification of Pareto efficient
implementations
Throughput vs. memory sizeLatency vs. energy consumption
Scalability
Customized streaming aggregation implementation
...
Methodology output
Metric17.41647.41597.35627.35677.33657.3336
Q160: A1(loc), A2(loc), …, B4(b.w.)Q160: A1(loc), A2(loc), …, B4(p.s.)Q320: A1(loc), A2(loc), …, B4(b.w.)Q320: A1(loc), A2(loc), …, B4(p.s.)Q640: A1(loc), A2(loc), …, B4(p.s.)
......
Implementations evaluated:
7.25
7.3
7.35
7.4
7.45
40 60 80 100
P1
Metric1 vs. Metric2
P2P3
P4M
etric
1Metric2
Input:
METHODOLOGY
EXAMPLE
![Page 28: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/28.jpg)
Streaming Aggregation Operator Customization In Embedded Systems
I. Walulya @ CTH 28
Evaluationsetup
n Dataset: Soundcloud (user id, timestamp, song id, comment)n Query: user id with the highest number of comments.n Platforms: Myriad1 (8 cores), Myriad2 (12 cores). n Evaluation metrics: Throughput, Memory size, Latency, energy
consumption
![Page 29: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/29.jpg)
Streaming Aggregation Operator Customization In Embedded Systems
I. Walulya @ CTH 29
Multiwaystreamingaggregationresults:throughput,latency,energyandmemory
![Page 30: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/30.jpg)
Streaming Aggregation Operator Customization In Embedded Systems
I. Walulya @ CTH 30
Performanceperwatt
Latency(usec) Throughput (t/sec) (t/sec)/watt
Myriad1 140.38 123,622 379,041
Myriad2 39.8 497,154 1,004,766
Intel XeonE5 15 1,105,221 18,412
n x20 highest performance per watt in Myriad1n x54 highest performance per watt in Myriad2
![Page 31: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/31.jpg)
Conclusions
I. Walulya @ CTH 31
n Designed efficient concurrent data structure implementations for
embedded system applications.
n Evaluation of a concurrent data structure implementation model
based on message-passing. Design space exploration of streaming
aggregation implementation on embedded architectures.
n Data Streaming: Major departure from traditional persistent
database paradigm
n Fundamental re-thinking of models, assumptions, algorithms, system
architectures, …
![Page 32: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/32.jpg)
I. Walulya @ CTH 32
![Page 33: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/33.jpg)
References
I. Walulya @ CTH 33
1. Lamport L.: Specifying Concurrent program modules. ACM Transactions on Programming Languages and Systems 5, (1983), 190 -222
2. Giacomoni, J., Moseley, T., Vachharajani, M.: FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ACM (2008) 43-52
3. Preud'homme, T., Sopena, J., Thomas, G., Folliot, B.: BatchQueue: Fast and Memory-Thrifty Core to Core Communication. In: 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). (2010) 215-222
4. Lee, P.P.C., Bu, T., Chandranmenon, G.: A lock-free, cache-efficient shared ring Buffer for multi-core architectures. In: Proceedings of the 5th ACM/IEEE Symposium on architectures for Networking and Communications Systems. ANCS '09, New York, NY, USA, ACM (2009) 78-79
5. Michael, M., Scott, M.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on Principles of distributed computing, ACM (1996) 267-275
6. Tsigas, P., Zhang, Y.: A Wait-free Queue As Fast As Fetch-and-add. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP(2016) 16:1--16:13
![Page 34: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/34.jpg)
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
34
1:Dataarefetchedfromoff-chipDDR,inCMXusingDMA.(L2Cacheisenabledinnormalmode).2:TuplesarecopiedfromoneCMXslicetoanotherCMXsliceusingmemcpy orDMA.3:4BytepointersaretransferredconstantlyfromoneSHAVEtoanotherfordeterminingwhichwindowsshouldberemoved.
![Page 35: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/35.jpg)
Evaluation of Streaming Aggregation Operator in Low Power Embedded Systems
35
Threeproducersvariation
![Page 36: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/36.jpg)
Queues
36
1000
5
10
15
20
25
1 2 4 6 8 10 12Shaves
Throughput(Mops/s)
1−lock 2−lock FAAQueue
![Page 37: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/37.jpg)
Queues
37
1000
500
600
700
800
1 2 4 6 8 10 12Shaves
Power(mW)
1−lock 2−lock FAAQueue
![Page 38: Energy EfficientData Stream Processingon Ultra-Low-Power ... … · APB 128-bit AXI 128-bit AHB 128-bit AHB Ø 65nm ultra-low power architecture (≤ 0.35W@180MHz) with 11 power islands](https://reader034.vdocument.in/reader034/viewer/2022051804/5ff0f6f896458c240f2cd085/html5/thumbnails/38.jpg)
Queues
38
1000
50
100
150
1 2 4 6 8 10 12Shaves
Ener
gy p
er O
pera
tion
(mJ/
op)
1−lock 2−lock FAAQueue