two routes to specialisation - vtt technical … routes to specialisation: loki and lowrisc robert...

41
Two routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland

Upload: truongkhanh

Post on 27-Mar-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Two routes to specialisation:Loki and lowRISC

Robert Mullins, University of Cambridge

WEEE 10-12 September 2015

Espoo, Finland

Page 2: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Specialisation

• More transistors but end of Dennard scaling• Dark silicon, utilisation wall etc.

• Specialisation is an answer, but not without problems

Page 3: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Specialisation

• More transistors but end of Dennard scaling• Dark silicon, utilisation wall etc.

• Specialisation is an answer, but not without problems

• Some possible directions• Many heterogeneous SoCs

• Tackle complexity with open-source? (lowRISC)

• Explore how to make SoC designs more flexible (target broader markets)

• Homogeneous sea of resources • FPGA -> CGRA -> manycore/MPPA (Loki)

• Specialise “software” for each application

Page 4: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Loki

• Simple tiled many-core processor• 8-cores per tile + 64KB SRAM, 1mm^2 @ 40nm• Each core is a complete 32-bit processor • 5-25pJ/op @ 40nm (<2W for 128-cores)

• Message-passing support at ISA level • Every instruction can send its result to a remote location on chip• Register mapped FIFOs• Fast multicast support within a tile• No cache coherency support between tiles at present (can share data via L2)

• Configurable on-chip memory system• Each tile contains SRAM that may be dedicated as scratchpad, L1 or L2 cache

Page 5: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Loki

64KB

8 cores

Local interconnects

Inter-tile routers

• Chip-wide networks:1. L1$ to L2$ requests

2. L2$ to memory requests

3. Core to core data

4. Mem/L2$ responses

5. Credit network

Page 6: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

A Loki tile

Page 7: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

A Loki tile

Sequential consistency is retained within a tile as operations arrive at each bank in the order they entered the network (crossbar)[see Zhang PDCN’05]

Page 8: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation
Page 9: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation
Page 10: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Loki’s memory system

• Each bank can service a miss and offers hit-under-miss support

• Synchronization/atomics• Load-and-OP (AND, OR, XOR, ADD), Exchange

• LL/SC

• Can access command set at memory banks (sendconfig instruction)• Send cache line to another bank

• Flush, invalidate or prefetch cache lines

• Bypass L1/L2

• Memset cache line

• Same mechanism can be used to form packets on core-to-core network

Page 11: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation
Page 12: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation
Page 13: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Area (approx.)Cores ~50%SRAM 40-45%Routers 4-6%Other ~2-3%

Page 14: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Loki pipeline

• Small custom ISA• Incl. support for predicated

execution • 6 register mapped network

FIFOs (blocking reads)• Decoupled loads

• Every instruction can send its result on network• Can send instructions too!

• Channel map table• Read in decode stage• 16 entry table that maps

channel names to “network addresses”

Page 15: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Example

uint32_t updateCRC32(uint8_t ch,uint32_t crc)

{return table[(crc ˆ ch) & 0xff] ˆ

(crc >> 8);}

setchmapi 1, r15[...]fetch r10xor r11, r14, r13lli r12, %lo(table)lui r12, %hi(table)andi r11, r11, 255slli r11, r11, 2addu r11, r12, r11ldw 0(r11) -> 1srli r12, r14, 8xor.eop r11, r2, r12

Page 16: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

L0 I$ / scratchpad

• Fetch stage contains small (64 instruction), fully associative, I$• Can skip tag checks with

“in buffer jmp”

• Instructions just executed in FIFO order until end of packet (don’t have an actual PC)

• Execute stage contains small (256 word) local scratchpad

Page 17: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Execution patterns (within a tile)

• MIMD

• DLP (SIMD)

• DLP with helper core (scalarization)• One core is dedicated to provide common data over multicast bus• Enables work done by remaining data-parallel cores to be reduced

• Worker farm

• Task-level pipelines

• Dataflow (single persistent instruction packet per core)• Can support a single instruction per core

[See UCAM-CL-TR-846 for full details]

Page 18: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Example: JPEG colour conversation

[Bates13]

Page 19: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

DOACROSS loops

• [Campanoni et al. ISCA 2014]

• Substantial speedup available from exploiting DOACROSS parallelism

• 16 in-order cores (“Atom” like)

• Much improved performance with low-latency communication mechanism (“ring cache” RC) for signalsand values

Page 20: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Example: ADPCM (encoder)

• We can exploit some DOACROSS parallelism in the case of ADPCM• Achieves 2X using 3 cores

• Can do slightly better by simply splitting loop body across two cores• Body then fits in core’s L0 I$

• ~2.5X on 2 cores

• Plan to explore simplified HELIX implementation for our Loki LLVM port• Fast signals and shared L1 should make Loki a good target

Page 21: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

ILP Splitting

• Another approach to grouping/fusing cores

• LLVM pass to automatically split a program across N cores in a tile using available ILP, communicates values over local tile core to core network

• Early results:• Stencil2D (MachSuite) – 1.78X (3 cores)• Gemm/Blocked (MachSuite) – 1.75X (3 cores)• Matrix Multiply (2 cores)

• Initial attempt – 0.72X • With use of “restrict” – 1.41X • Exploit commutativity – 1.86X

• Currently, exploring optimisations to consider the order basic blocks are visited and some microarchitectural enhancements (inp. FIFO issues)

[Alex Bradbury]

Page 22: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

AES case study

• AES-128-CTR mode

• 2 days work for a recent graduate

• Want to avoid running same code on each core:• Would have poor L0 I$ performance• Cores would produce less regular memory accesses

• Instead, the AES code is mapped as a task pipeline

• Loki Results• 5.1 cycles/byte on one tile• 2.5 cycles/byte on two tiles

• 11.5Gps at 450MHz for 128-cores

• Comparison: ARM + NEON • Bitsliced implementation • Lower bound is 13 cycles/byte

Page 23: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

AES example: single tile (8-core) mapping

• Cores 1-5 address banks 2-5 using 4 separate channels to save 1 or 2 address manipulation instructions in the loop body

Page 24: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Current status

• Loki LLVM compiler implementation

• ISS + complete SystemC model

• SystemVerilog implementation is complete ( < 30K LOC)• Generates 128-core ASIC version and 32-core FPGA implementation

• Test infrastructure, including random program generator

• Promising single-tile and multi-tile results

• Will tape-out very soon!• 4mm x 4mm die, TSMC 40nm (128 cores, 1MB on-chip cache)• Off-chip I/O to FPGA “Northbridge”

• 4 x 13-bit length matched full-duplex source synchronous DDR channels

Page 25: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Development boards

• Dev. boards will be available next year.

• Package (352 ball BGA) and board from Michael Taylor’s group at UCSD• See http://bjump.org

• Community • Aim to distribute boards to

research groups or provide remote access

• Support research in compilers, mapping, applications etc.

Page 26: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

“Subject: Redo BBC Micro” (2008)

Page 27: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

“Subject: Development of an open-source SoC” (2014)

• “Create an open-source SoC capable of running Linux well”

• Make it real to encourage contributions and grow community• Volume silicon manufacture

• Ability to purchase in small quantities

• Low-cost development board

• Regular updates to SoC

• Events, training and documentation

• lowRISC C.I.C (Not-for-profit company)

Page 28: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Why create an open source SoC?

• Research and teaching

• Serve the open-source community

• Demand from industry• Remove constraints on use of processor IP

• Use lots of cores freely to provide flexible implementation

• Lower costs – create proven base for derivatives

• Why now?

Page 29: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Approach to design

• Aim for simplicity• no backwards compatibility issues, no baggage, clean sheet design

• Think about security from the start

• Free from commercial influences and release cycles• Cores are free and customisable (one ISA)

• Aim to maximise functionality and flexibility(no trade-offs to create product range)

Page 30: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

RISC-V

• RISC-V ISA from UC Berkeley

– Aim to create open ISA standard for industry

– Explicitly designed to be extensible

– Simple base integer ISA (~40 instructions)

– 32-bit, 64-bit, 128-bit (!) variants

• Rocket SoC: cores, L1, L2 cache, interconnect

• Silicon proven (45nm and 28nm)

• Chisel (open-source HW construction language)

Page 31: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

lowRISC SoC

Page 32: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Current status

Page 33: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

General purpose tagged memory

• Prevent control-flow hijacking attacks

• Accelerate debug tools

• use-after-free detection

• Per-word locks, full/empty bits for synchronization

• Control-flow integrity

• Assist Garbage collection

• Dynamic information flow tracking (DIFT)

• Capabilities

• Transactional memory

• Provenance tracking

• ……

Page 34: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

General-purpose tagged memory

• LLVM pass has been implemented to tag “sensitive” pointers• i.e. code pointers, virtual function table pointer, function pointers, ….

• Every load of a sensitive pointer is replaced with a load that expects a particular tag to be read, if this is not the case an exception is raised

• Prevents classic buffer overflow attacks and return-orientated-programming

• Some other related attacks may remain if code has the right/wrong! bugs

• Overheads and future work

Page 35: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Minion cores

• Will initially support DMA and programmable I/O• Use minions to generate I/O signals, pre-processor I/O data etc.

• Would like to also use minions to support tagged memory• Particular tags trigger message to minion from application processor

• Minion executes security policy in parallel with app. Processor

• Plan to investigate implementing more of the SoC using minion cores + appropriate “shims”• E.g. memory controller

• Will use the “Pulpino” core from Luca Benini’s group at ETHZ

Page 36: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Open source HW

• Smaller community, higher barrier to entry

• Fabricating chips is expensive

• Verification effort is significant• Patching can't be done in the same way typically

• Of course, all good reasons to produce an open known good SoCdesign and to promote a community effort

Page 37: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Roadmap

2015• Create “untethered” version of SoC with tagged memory

• Complete core SoC implementation (no GPU initially)

2016

• First test chips (40 or 28nm)

• 2 to 4 cores, most probably dual-issue

• Integrate 3rd party IP, e.g. mem controller, USB, Ethernet

• Support early adopters in creating derivative designs

2017• Volume fab. run for community dev. board

• Strengthen lowRISC IP offerings

Third

Party

De

sign Starts

Page 38: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Research in the open

• Have lots of ideas, collaborate and share from day one

• Open development helps to attract best people, even if they contribute remotely (huge amount of good will and enthusiasm for these projects – if people know what you are trying to do!)

• Make it easy for people to get involved, reproduce, extend and improve (this requires significant effort)

• Work with industry

• Provide vehicle to evaluate/implement other research ideas

Page 39: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Find out more and get involved…

• ORCONF 2015 • October 9-11th, 2015

• Ideasquare, Geneva

• ORCONF began as an annual event for openRISC developers. Now run as a Free and Open Source Silicon (FOSSi) event.

• lowRISC workshop on Friday

• Talks on RISC-V

Page 40: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Final thoughts

• Exploring two different approaches to achieving energy efficiency through specialisation:• Loki: flexible processor array

• lowRISC: an open source SoC

• Opportunities to collaborate with others on both projects

• More information about lowRISC at www.lowrisc.org• See also phab.lowrisc.org

• Sign up to announcement and discussion lists

• Email: [email protected]

Page 41: Two routes to specialisation - VTT Technical … routes to specialisation: Loki and lowRISC Robert Mullins, University of Cambridge WEEE 10-12 September 2015 Espoo, Finland Specialisation

Acknowledgements

• Both lowRISC and Loki are team efforts• Loki team currently includes Daniel Bates, Alex Bradbury and Alex Chadwick

(Recent work on DNNs by Chihang Wang and Sam Tarver. Earlier work on configurable L1 memory system by Andreas Koltes)

• lowRISC team currently includes Wei Song, Alex Bradbury and numerous external contributors. Contributions on tagged memory and minion core I/O shims by Hongyan Xia and Martin Papadopoulos. Recent work on tagged memory architecture and LLVM support by Lucas Sonnabend and Matthew Toseland

• Loki is funded by an ERC starter grant (GA n. 306386) • This work was previously supported by UK EPSRC grant EP/G033110/1

• lowRISC is kindly supported by a private donation and a donation from Google.

• Thank you for listening!