investigating hardware micro-instruction folding in a java...

28
Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Investigating Hardware Micro-Instruction Folding in a Java Embedded Processor Flavius Gruian 1 Mark Westmijze 2 1 Lund University, Sweden [email protected] 2 University of Twente, The Netherlands [email protected] Java Technologies for Real-time and Embedded Systems, 2010 1 / 17

Upload: others

Post on 23-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Investigating Hardware Micro-Instruction Foldingin a Java Embedded Processor

Flavius Gruian1 Mark Westmijze2

1Lund University, [email protected]

2University of Twente, The [email protected]

Java Technologies for Real-time and Embedded Systems, 2010

1 / 17

Page 2: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Outline

1 Introduction

2 Folding BlueJEP

3 Implementation and Experiments

4 Discussion

5 Conclusion

2 / 17

Page 3: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Goal

What are we trying to do?

Implement bytecode folding on an existing Java embeddedprocessor and evaluate the results with respect to:

theoretical estimates

absolute speed-up

performance w.r.t. device area

Finally...

Is it worth it?

3 / 17

Page 4: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Goal

What are we trying to do?

Implement bytecode folding on an existing Java embeddedprocessor and evaluate the results with respect to:

theoretical estimates

absolute speed-up

performance w.r.t. device area

Finally...

Is it worth it?

3 / 17

Page 5: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Starting Point

Original Processor Architecture

BlueJEP

BlueSpec System Verilog Java Embedded Processor,a redesign of JOP [M. Schoberl]

micro-programmed, stack machine core

predictable rather than high-performance (RT systems)

JOP micro-instruction set (for ease of programming)

specified in BSV [see JTRES 2007]

4 / 17

Page 6: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Starting Point

Original Processor Architecture

BlueJEP

BlueSpec System Verilog Java Embedded Processor,a redesign of JOP [M. Schoberl]

micro-programmed, stack machine core

predictable rather than high-performance (RT systems)

JOP micro-instruction set (for ease of programming)

specified in BSV [see JTRES 2007]

4 / 17

Page 7: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

BlueJEP Architecture

Six Stages Micro-Programmed Pipeline

Fetch Bytecode

Fetch micro-I

Decode& Fetch Register

Fetch Stack Execute Write-

back

micro-ROM

BC2 microA

jump table

bypassforward

BC-Cache

jpc

StackRegisters

bus interface (OPB)

load cache

SP VP

MD MrAMwA

PC

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6bc

fifo

decfi

fo

fsfif

o

exfif

o

wbfif

o

OPD

constCacheCtl

rollback

MMU access registers

5 / 17

Page 8: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Folding Theory

Bytecode Folding Theory

stack machine (JVM) code can be shorter on multi-addressmachines that emulate them

stack code 3-address code≈7 bytes ≈4 bytes

iload a add a, b, ciload biadd

istore c

folding pattern length depends on the available resources(ALUs, memory ports)

bytecodes are grouped in classes by resource access, e.g.:

P producer: pushes a value in the stackC consumer: pops a value in the stackO operation: uses top two and pushes back a resultS special: not foldable (breaks a pattern)

6 / 17

Page 9: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Folding Theory

Bytecode Folding Theory

stack machine (JVM) code can be shorter on multi-addressmachines that emulate them

stack code 3-address code≈7 bytes ≈4 bytes

iload a add a, b, ciload biadd

istore c

folding pattern length depends on the available resources(ALUs, memory ports)

bytecodes are grouped in classes by resource access, e.g.:

P producer: pushes a value in the stackC consumer: pops a value in the stackO operation: uses top two and pushes back a resultS special: not foldable (breaks a pattern)

6 / 17

Page 10: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Folding Theory

Bytecode Folding Theory

stack machine (JVM) code can be shorter on multi-addressmachines that emulate them

stack code 3-address code≈7 bytes ≈4 bytes

iload a add a, b, ciload biadd

istore c

folding pattern length depends on the available resources(ALUs, memory ports)

bytecodes are grouped in classes by resource access, e.g.:

P producer: pushes a value in the stackC consumer: pops a value in the stackO operation: uses top two and pushes back a resultS special: not foldable (breaks a pattern)

6 / 17

Page 11: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Folding Scheme

Adopted Folding Scheme

fixed folding pattern approach [picoJava-II]

micro-instruction level (rather than bytecode level)

maximum length of four micro-instructions(at most four single instruction bytecodes)

Folding Pattern Length

ppoc 4poc 3ppc 3pc 2oc 2po 2

7 / 17

Page 12: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Folding Scheme

Pre-design Estimates

How much is the number of executed clock cycles reduced?

Processed cycle accurate simulation traces say:

≈ 30% fewer cycles for 0-delay memory

≈ 25% fewer cycles for realistic memory

8 / 17

Page 13: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Folding Scheme

Pre-design Estimates

How much is the number of executed clock cycles reduced?Processed cycle accurate simulation traces say:

≈ 30% fewer cycles for 0-delay memory

≈ 25% fewer cycles for realistic memory

8 / 17

Page 14: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Design

Architectural Changes

Increase fetch parallelism to allow folding:

wider fetch-bytecode stage: up to four bytecodes must beavailable simultaneously.

multiple bytecode FIFOs: to feed the next stage withsequences of bytecodes.

wider fetch-instruction stage: up to four differentmicro-addresses must be read simultaneously.

multiple micro-instruction FIFOs: to provide patterns tothe decode stage.

folding schemes in the decode stage: to identify andhandle foldable patterns.

9 / 17

Page 15: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Design

Configurability

Highly configurable architecture:1 bytecode bandwidth (1,2,4)2 micro-instruction bandwidth (1,2,4)3 foldable patterns

Fetch Bytecode

Fetch micro-I

Decode& Fetch Register

micro-ROMBC2

microA

jump table

Stage 1 Stage 2 Stage 3

bcfif

os

decfi

fos

Figure: Handling 2 bytecodes, 4 micro-instructions simultaneously.

10 / 17

Page 16: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Design

Configurability

Highly configurable architecture:1 bytecode bandwidth (1,2,4)2 micro-instruction bandwidth (1,2,4)3 foldable patterns

Fetch Bytecode

Fetch micro-I

Decode& Fetch Register

micro-ROMBC2

microA

jump table

Stage 1 Stage 2 Stage 3

bcfif

os

decfi

fos

Figure: Handling 2 bytecodes, 4 micro-instructions simultaneously.10 / 17

Page 17: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Setup and Tools

Synthesis → device area, maximum clock frequency

FPGA, Xilinx Virtex-5 (XC5VLX30-3)BSV compiler 2006.11, BSV → VerilogXilinx EDK 9.1i, Verilog + IPs → SystemXilinx ISE 9.1i, System→ FPGAChipscope, to calibrate simulation

Simulation → executed clock cycles

Desktop, LinuxBSV compiler 2006.11, BSV → Executablecustom tools for parsing the output frominstrumented code

11 / 17

Page 18: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Setup and Tools

Synthesis → device area, maximum clock frequency

FPGA, Xilinx Virtex-5 (XC5VLX30-3)BSV compiler 2006.11, BSV → VerilogXilinx EDK 9.1i, Verilog + IPs → SystemXilinx ISE 9.1i, System→ FPGAChipscope, to calibrate simulation

Simulation → executed clock cycles

Desktop, LinuxBSV compiler 2006.11, BSV → Executablecustom tools for parsing the output frominstrumented code

11 / 17

Page 19: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Results

Original vs. Folding Configurations (2,2; 2,4)

0

0.5

1.0

1.5

2.0

2.5

Relative Clk Cycles Relative Clk Frequency Relative Device AreaRelative Performance Relative Performance/Area Unit

12 / 17

Page 20: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Results

Original vs. Folding Configurations (4,4)

0

0.5

1.0

1.5

2.0

2.5Relative Clk Cycles Relative Clk Frequency Relative Device AreaRelative Performance Relative Performance/Area Unit

13 / 17

Page 21: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Discussion

Introducing folding and more patterns:

+ reduce the executed clock cycles (as in theory), but

- . . . greatly reduce the maximum clock frequency

- . . . and also greatly increase the required device area

Performance/area unit gets as low as 1/4 for some designs withmaximal folding!

Introducing more simple processors instead of using folding wouldbe more efficient.

14 / 17

Page 22: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Discussion

Introducing folding and more patterns:

+ reduce the executed clock cycles (as in theory), but

- . . . greatly reduce the maximum clock frequency

- . . . and also greatly increase the required device area

Performance/area unit gets as low as 1/4 for some designs withmaximal folding!

Introducing more simple processors instead of using folding wouldbe more efficient.

14 / 17

Page 23: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Provisions

Reservations:

using RT-level vhdl instead of BSV may offer better controlover the critical path

introducing more stages may increase clock frequency

multi-method caches instead of one-method cache wouldimprove overall performance

other applications than the one we used (GC) could exhibitmore folding potential

more elaborate folding schemes may be more effective

15 / 17

Page 24: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Finally...

Summary We evaluated folding schemes for BlueJEP andconclude that the performance greatly decreasesalthough the number of executed cycles is reduced.

Observation Theoretical gains are not enough to show efficiency.Complete implementations must be evaluated!

Recommendation For our case, using several simple processorsis potentially more efficient.

16 / 17

Page 25: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Finally...

Summary We evaluated folding schemes for BlueJEP andconclude that the performance greatly decreasesalthough the number of executed cycles is reduced.

Observation Theoretical gains are not enough to show efficiency.Complete implementations must be evaluated!

Recommendation For our case, using several simple processorsis potentially more efficient.

16 / 17

Page 26: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Finally...

Summary We evaluated folding schemes for BlueJEP andconclude that the performance greatly decreasesalthough the number of executed cycles is reduced.

Observation Theoretical gains are not enough to show efficiency.Complete implementations must be evaluated!

Recommendation For our case, using several simple processorsis potentially more efficient.

16 / 17

Page 27: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Thank you!

Questions?

17 / 17

Page 28: Investigating Hardware Micro-Instruction Folding in a Java ...d3s.mff.cuni.cz/jtres2010/slides/p102-gruian-slides.pdf · Introduction Folding BlueJEP Implementation and Experiments

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Thank you!

Questions?

17 / 17