a scalable high-precision and high-throughput architecture ... · 30th ieee international...

31
Improving Emulation of Quantum Algorithms using Space-Efficient Hardware Architectures Naveed Mahmud, and Esam El-Araby University of Kansas (KU) 30th IEEE International Conference on Application-specific Systems, Architectures and Processors July 15-17, 2019 Cornell Tech, New York ASAP 2019

Upload: others

Post on 03-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

Improving Emulation of Quantum Algorithms using Space-Efficient Hardware Architectures

Naveed Mahmud, and Esam El-Araby

University of Kansas (KU)

30th IEEE International Conference onApplication-specific Systems, Architectures and Processors

July 15-17, 2019Cornell Tech, New York

ASAP 2019

Page 2: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

2 ASAP 2019 – July 16th, 2019

Outline

Introduction and MotivationRelated Work and BackgroundProposed WorkExperimental ResultsConclusions and Future Work

Page 3: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

3 ASAP 2019 – July 16th, 2019

Introduction and Motivation

Why Quantum Computing? Efficient quantum algorithms Solving NP-hard problems Speedup over classical

Need for Quantum Emulation Difficulty of maintenance & control High-cost of access

E.g., academic hourly rate of $1,250 up to 499 annual hours

Verification and benchmarking Analysis of quantum algorithms Improving classical computing

paradigms

Emulation using FPGAs Greater speedup vs. SW Dynamic (reconfigurable) vs. fixed

architectures Exploiting parallelism Limitation → Scalability

source: https://learning.acm.org/techtalks/qiskit

source: https://learning.acm.org/techtalks/qiskit

Page 4: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

4 ASAP 2019 – July 16th, 2019

Introduction and Motivation

Why Quantum Computing? Efficient quantum algorithms Solving NP-hard problems Speedup over classical

Need for Quantum Emulation Difficulty of maintenance & control High-cost of access

E.g., academic hourly rate of $1,250 up to 499 annual hours

Verification and benchmarking Analysis of quantum algorithms Improving classical computing

paradigms

Emulation using FPGAs Greater speedup vs. SW Dynamic (reconfigurable) vs. fixed

architectures Exploiting parallelism Limitation → Scalability

Google’s 72-qubit “Bristlecone” Intel’s 49-qubit “Tangle Lake” IBM-Q 50-qubit computer

D-Wave 2000QIonQ’s 79-qubit computerRigetti’s 16-qubit ASPEN-4

source: https://learning.acm.org/techtalks/qiskit

Page 5: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

5 ASAP 2019 – July 16th, 2019

Outline

Introduction and MotivationRelated Work and BackgroundProposed WorkExperimental ResultsConclusions and Future Work

Page 6: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

7 ASAP 2019 – July 16th, 2019

Related Work (Parallel SW Simulators)

Villalonga, et al., “Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation,” May 2019 Simulation of 7x7 and 11x11 random quantum circuits (RQCs) of depth 42 and 26 respectively. Summit supercomputer (ORNL, USA) with 4550 nodes 1.6 TB of non-volatile memory per node Power consumption of 7.3 MW

Li et al., “Quantum Supremacy Circuit Simulation on Sunway TaihuLight,” Aug. 2018 Simulation of 49-qubit random quantum circuits of depth of 55 Sunway supercomputer (NSC, China) with 131,072 nodes (32,768 CPUs) 1 PB total main memory

J. Chen, et al., “Classical Simulation of Intermediate-Size Quantum Circuits,” May 2018 Simulation of up to 144-qubit random quantum circuits of depth 27 Supercomputing cluster (Alibaba Group, China) with 131,072 nodes 8 GB memory per node

De Raedt et al., “Massively parallel quantum computer simulator eleven years later,” May 2018 Simulation of Shor’s algorithm using 48-qubits Various supercomputing platforms: IBM Blue Gene/Q (decommissioned), JURECA (Germany), K computer (Japan), Sunway TaihuLight (China) Up to 16-128 GB memory/node utilized

T. Jones, et al., “QuEST and High Performance Simulation of Quantum Computers,” May 2018 Simulation of random quantum circuits up to 38 qubits ARCUS supercomputer (ARCHER, UK) with 2048 nodes Up to 256 GB memory per node

List of quantum SW simulators

https://quantiki.org/wiki/list-qc-simulators

Page 7: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

8 ASAP 2019 – July 16th, 2019

Related Work (FPGA Emulators)

J. Pilch, and J. Dlugopolski, “An FPGA-based real quantum computer emulator” December 2018 Results for up to 2-qubit Deutsch’s algorithm Details of precision used not presented Limited scalability

A. Silva, and O.G. Zabaleta, “FPGA quantum computing emulator using high level design tools,” August 2017 Results for up to 6-qubit QFT Details of precision used not presented No approach to improve scalability

Y.H. Lee, M. Khalil-Hani, and M.N. Marsono, “An FPGA-based quantum computing emulation framework based on serial-parallel architecture,” March 2016 Results of 5-qubit QFT and 7-qubit Grover’s reported Up to 24-bit fixed-point precision No optimizations to make designs scalable

A.U. Khalid, Z. Zilic, and K. Radecka, “FPGA emulation of quantum circuits,” October 2004 3-qubit QFT and Grover’s search implemented Fixed-point precision (16 bits) Low operating frequency

M. Fujishima, “FPGA-based high-speed emulator of quantum computing,” December 2003 Logic quantum processor that abstracts quantum circuit operations into binary logic Coefficients of qubit states modeled as binary, not complex No resource utilization reported

Page 8: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

9 ASAP 2019 – July 16th, 2019

Background (Quantum Computing)

Qubits Physical implementations

Electron (spin) Nucleus (spin through NMR) Photon (polarization encoding) Josephson junction (superconducting qubits)

Theoretical representation Bloch sphere

» Basis states | ⟩0 , | ⟩1» Pure states | ⟩𝜓𝜓

Vector of complex coefficients

Superposition Linear sum of distinct basis states Converts to classical logic when measured Applies to state with n-qubits

Entanglement Strong correlation between qubits Entangled state cannot be factored Tensor (Kronecker) product representation

N = 2n basis states, where, n is number of qubits

| ⟩𝜓𝜓 = 𝛼𝛼| ⟩0 + 𝛽𝛽| ⟩1 ≡𝛼𝛼𝛽𝛽 , and

𝑝𝑝 𝜓𝜓 → | ⟩0 = 𝛼𝛼 2 ; 𝑝𝑝 𝜓𝜓 → | ⟩1 = 𝛽𝛽 2

| ⟩𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 = | ⟩𝑞𝑞𝑞 ⊗ | ⟩𝑞𝑞𝑞 ⊗ | ⟩𝑞𝑞𝑞| ⟩𝜓𝜓 = 𝛼𝛼1𝛼𝛼2𝛼𝛼3| ⟩000 + 𝛼𝛼1𝛼𝛼2𝛽𝛽3|0 ⟩01 + 𝛼𝛼1𝛽𝛽2𝛼𝛼3| ⟩010 +

… + 𝛽𝛽1 𝛽𝛽2𝛽𝛽3|1 ⟩11NMR ≡ Nuclear Magnetic Resonance

Page 9: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

11 ASAP 2019 – July 16th, 2019

Background (Quantum Fourier Transform)

Quantum Fourier Transform (QFT) Fundamental quantum algorithm Quantum equivalent of classical Discrete

Fourier Transform (DFT) Quadratic speedup over DFT Input

Coefficients of a quantum superimposed state

Output Coefficients of transformed superimposed

state

𝑈𝑈𝑄𝑄𝑄𝑄𝑄𝑄 ��𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜 =1𝑁𝑁�𝑘𝑘=0

𝑁𝑁−1

�𝑞𝑞=0

𝑁𝑁−1

𝑓𝑓 𝑞𝑞𝑞𝑞𝑞𝑞 𝑒𝑒 �2𝜋𝜋𝜋𝜋(𝑞𝑞𝑘𝑘𝑁𝑁 ⟩|𝒌𝒌 = �𝑘𝑘=0

𝑁𝑁−1

𝑪𝑪𝒌𝒌𝒐𝒐𝒐𝒐𝒐𝒐 ⟩|𝒌𝒌

𝑈𝑈𝑄𝑄𝑄𝑄𝑄𝑄 =1𝑁𝑁

1 1 1 … 11 𝜔𝜔𝑛𝑛 𝜔𝜔𝑛𝑛2 … 𝜔𝜔𝑛𝑛𝑁𝑁−1

1 𝜔𝜔𝑛𝑛2 𝜔𝜔𝑛𝑛4 … 𝜔𝜔𝑛𝑛2 𝑁𝑁−1

1 𝜔𝜔𝑛𝑛3 𝜔𝜔𝑛𝑛6 … 𝜔𝜔𝑛𝑛3 𝑁𝑁−1

⋮ ⋮ ⋮ ⋱ ⋮1 𝜔𝜔𝑛𝑛𝑁𝑁−1 𝜔𝜔𝑛𝑛

2(𝑁𝑁−1) … 𝜔𝜔𝑛𝑛(𝑁𝑁−1)(𝑁𝑁−1)

QFT matrix

𝜔𝜔 = 𝑒𝑒2𝜋𝜋𝜋𝜋𝑁𝑁 , 𝑁𝑁 = 𝑞𝑛𝑛, and 𝑛𝑛 = number of qubits

��𝜓𝜓𝜋𝜋𝑛𝑛 =1𝑁𝑁�𝑞𝑞=0

𝑁𝑁−1

𝑓𝑓 𝑞𝑞𝑞𝑞𝑞𝑞 ⟩|𝒒𝒒 = �𝑞𝑞=0

𝑁𝑁−1

𝑪𝑪𝒒𝒒𝒊𝒊𝒊𝒊 ⟩|𝒒𝒒

QFT

outψinψ UQFT

Page 10: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

12 ASAP 2019 – July 16th, 2019

Outline

Introduction and MotivationBackground and Related WorkProposed Work

Emulation Approaches Hardware Architectures

Experimental WorkConclusions and Future Work

Page 11: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

13 ASAP 2019 – July 16th, 2019

Emulation Approaches – Direct circuit/gate approach

QFT Emulation

𝑇𝑇2𝑇𝑇𝑞 𝑇𝑇3 𝑇𝑇4 𝑇𝑇5

5-qubit QFT algorithm/circuit

Hardware node models for 5-qubit QFT

Modeling the quantum circuit as a series of transformations

QFT ≡ Quantum Fourier Transform 𝑇𝑇1 = 𝐻𝐻⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼𝑇𝑇𝑞 = 𝐶𝐶𝐶𝐶2 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼𝑇𝑇𝑞 = 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼 . 𝐶𝐶𝐶𝐶3 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 .𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼𝑇𝑇4 = 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 . 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼 .

𝐶𝐶𝐶𝐶4 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 . 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼 .𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼𝑇𝑇5 = 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆 . 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 .

𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼 . 𝐶𝐶𝐶𝐶5 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 .𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼 . 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆

⋯Node1 Node2 Node3

Advantages Modular Generalized framework

Disadvantages High resource utilization Longer latency Poor scalability

Page 12: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

14 ASAP 2019 – July 16th, 2019

Emulation Approaches – CMAC approach

Methodology Vector-matrix multiplication

Complex multiply-and-accumulate (CMAC) Input quantum state vector, | ⟩𝝍𝝍𝒊𝒊𝒊𝒊

Algorithm reduced to a unitary matrix, 𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨 lookup / dynamic generation / stream

Output quantum state vector, | ⟩𝝍𝝍𝒐𝒐𝒐𝒐𝒐𝒐

Advantages Generalized approach for any quantum algorithm Independent of circuit depth Lower resource utilization Lower latency Higher scalability Parallelizable hardware architectures

Disadvantages Calculating 𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨 could be challenging, but doable

��𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜 = 𝑈𝑈𝐴𝐴𝐴𝐴𝐴𝐴 . ��𝜓𝜓𝜋𝜋𝑛𝑛

Page 13: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

15 ASAP 2019 – July 16th, 2019

Emulation Approaches – CMAC approach optimizations

Lookup Computation elements (𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨) stored in memory Scalability limited by available memory Speed limited by memory bandwidth

Dynamic generation Additional hardware units required for

generating 𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨

Improved memory utilization Improved scalability Speed limited by complexity of algorithm

generating 𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨

Stream No hardware required for 𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨

Improved memory utilization Improved scalability Speed limited by I/O bandwidth

��𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜 = 𝑈𝑈𝐴𝐴𝐴𝐴𝐴𝐴 . ��𝜓𝜓𝜋𝜋𝑛𝑛

Page 14: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

16 ASAP 2019 – July 16th, 2019

Outline

Introduction and MotivationBackground and Related WorkProposed Work

Emulation Approaches Hardware Architectures

Experimental WorkConclusions and Future Work

Page 15: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

17 ASAP 2019 – July 16th, 2019

Hardware Architectures – CMAC approach

Complex Multiply-and-Accumulate (CMAC) unit Implements vector-matrix multiplication on hardware Complex valued inputs Single/double-precision floating-point

𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑖𝑖 = �𝑗𝑗=0

𝑁𝑁−1

𝑹𝑹𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓(𝒊𝒊, 𝒋𝒋)

𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜𝜋𝜋𝑖𝑖𝑟𝑟𝑖𝑖 𝑖𝑖 = �

𝑗𝑗=0

𝑁𝑁−1

𝑹𝑹𝒊𝒊𝒊𝒊𝒓𝒓𝒊𝒊(𝒊𝒊, 𝒋𝒋)

where, 𝑖𝑖 = 0, 1, 𝑞, … , 𝑁𝑁 − 1

𝑹𝑹𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝒊𝒊, 𝒋𝒋 = 𝜓𝜓𝜋𝜋𝑛𝑛𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑗𝑗 × 𝑼𝑼𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝒊𝒊, 𝒋𝒋 − 𝜓𝜓𝜋𝜋𝑛𝑛𝜋𝜋𝑖𝑖𝑟𝑟𝑖𝑖 𝑗𝑗 × 𝑼𝑼𝒊𝒊𝒊𝒊𝒓𝒓𝒊𝒊 𝒊𝒊, 𝒋𝒋

𝑹𝑹𝒊𝒊𝒊𝒊𝒓𝒓𝒊𝒊 𝒊𝒊, 𝒋𝒋 = 𝜓𝜓𝜋𝜋𝑛𝑛𝜋𝜋𝑖𝑖𝑟𝑟𝑖𝑖 𝑗𝑗 × 𝑼𝑼𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝒊𝒊, 𝒋𝒋 + 𝜓𝜓𝜋𝜋𝑛𝑛𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑗𝑗 × 𝑼𝑼𝒊𝒊𝒊𝒊𝒓𝒓𝒊𝒊 𝒊𝒊, 𝒋𝒋

Page 16: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

18 ASAP 2019 – July 16th, 2019

Hardware Architectures – CMAC approach

Single CMAC Fully optimized for area One CMAC instance 𝑵𝑵 cycles to store input state vector elements 𝑵𝑵𝟐𝟐 cycles to compute for all algorithm matrix elements

N-concurrent CMAC Fully optimized for speed 𝑵𝑵 parallel CMAC instances 𝑵𝑵 cycles to store input state vector elements 𝑵𝑵 cycles to compute for all algorithm matrix elements

Dual-sequential CMAC Two CMAC instances connected serially Storage overlapped with computations Improved execution time

Single CMAC

N-concurrent CMAC

Dual sequential CMACN = 2n basis states, and n = number of qubits

Page 17: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

19 ASAP 2019 – July 16th, 2019

Hardware Architectures – CMAC approach

CMAC Architecture

Complexity

Space (𝑶𝑶𝒔𝒔𝒔𝒔𝒓𝒓𝒔𝒔𝒓𝒓) Time (𝑶𝑶𝒐𝒐𝒊𝒊𝒊𝒊𝒓𝒓)

Single 𝑂𝑂(1) 𝑂𝑂(𝑁𝑁2)

N-concurrent 𝑂𝑂(𝑁𝑁) 𝑂𝑂(𝑁𝑁)

Dual sequential 𝑂𝑂(1) 𝑂𝑂(𝑁𝑁2)

Space and Time complexities of proposed architectures

Complexity analysis of CMAC Architectures Single CMAC𝑶𝑶𝒐𝒐𝒊𝒊𝒊𝒊𝒓𝒓 = 𝑨𝑨𝟏𝟏 + 𝑵𝑵 + 𝑵𝑵𝟐𝟐 × 𝑻𝑻𝒔𝒔𝒓𝒓𝒐𝒐𝒔𝒔𝒌𝒌 = 𝑶𝑶(𝑵𝑵𝟐𝟐)𝑶𝑶𝒔𝒔𝒔𝒔𝒓𝒓𝒔𝒔𝒓𝒓 = 𝟏𝟏 × 𝑪𝑪𝑪𝑪𝑨𝑨𝑪𝑪 = 𝑶𝑶(𝟏𝟏)

N-concurrent CMACs𝑶𝑶𝒐𝒐𝒊𝒊𝒊𝒊𝒓𝒓 = 𝑨𝑨𝟐𝟐 + 𝟐𝟐𝑵𝑵 × 𝑻𝑻𝒔𝒔𝒓𝒓𝒐𝒐𝒔𝒔𝒌𝒌 = 𝑶𝑶(𝑵𝑵)𝑶𝑶𝒔𝒔𝒔𝒔𝒓𝒓𝒔𝒔𝒓𝒓 = 𝑵𝑵 × 𝑪𝑪𝑪𝑪𝑨𝑨𝑪𝑪𝒔𝒔 = 𝑶𝑶(𝑵𝑵)

Dual-sequential CMACs𝑶𝑶𝒐𝒐𝒊𝒊𝒊𝒊𝒓𝒓 = 𝑨𝑨𝟑𝟑 + 𝑵𝑵𝟐𝟐 × 𝑻𝑻𝒔𝒔𝒓𝒓𝒐𝒐𝒔𝒔𝒌𝒌 = 𝑶𝑶(𝑵𝑵𝟐𝟐)𝑶𝑶𝒔𝒔𝒔𝒔𝒓𝒓𝒔𝒔𝒓𝒓 = 𝟐𝟐 × 𝑪𝑪𝑪𝑪𝑨𝑨𝑪𝑪𝒔𝒔 = 𝑶𝑶(𝟏𝟏)

𝑨𝑨𝟏𝟏,𝑨𝑨𝟐𝟐,𝑨𝑨𝟑𝟑 ≡ initial pipeline latencies𝑻𝑻𝒔𝒔𝒓𝒓𝒐𝒐𝒔𝒔𝒌𝒌 ≡ system clock period

Page 18: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

20 ASAP 2019 – July 16th, 2019

Outline

Introduction and MotivationBackground and Related WorkProposed WorkExperimental WorkConclusions and Future Work

Page 19: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

21 ASAP 2019 – July 16th, 2019

Experimental Setup

Testbed Platform High-performance reconfigurable

computing (HPRC) system from DirectStream

Single compute node to warehouse scale multi-node deployments

OS-less, FPGA-only (Arria 10) architecture On-chip memory (OCM) On-board memory (OBM) Removes interconnection bottlenecks Reduces resource contention and energy use

Highly productive development environment Parallel High-Level Language C++-to-HW (previously Carte-C) compiler Smooth learning curve + fast development time

DirectStream (DS8) system

Page 20: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

23 ASAP 2019 – July 16th, 2019

Experimental Results

QFT emulation using on-chip memory (OCM) architectures

Number of qubits

On-chip resource* utilization (%) Emulation time (sec)**ALMs BRAMs DSPs

2 10.30 8.04 1.05 1.4E-6

3 10.24 8.12 1.05 1.15E-6

4 10.24 8.11 1.05 2.01E-6

5 10.27 8.18 1.05 5.37E-6

6 10.26 8.55 1.05 1.87E-5

7 10.26 10.25 1.05 7.17E-5

8 10.29 16.73 1.05 3.19E-4

9 10.31 41.28 1.05 0.0013

Single CMAC (lookup)

*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Operating frequency: 233 MHz

ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block

Page 21: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

24 ASAP 2019 – July 16th, 2019

Experimental Results

Number of qubits

On-chip resource* utilization (%) Emulation time (sec)**ALMs BRAMs DSPs

2 10.70 8.04 1.05 6.78E-7

3 10.74 8.12 1.05 7.64E-7

4 11.53 8.11 1.05 9.36E-7

5 17.10 8.18 1.05 1.28E-6

6 24.5 8.55 1.05 1.97E-6

7 39.5 10.25 1.05 3.34E-6

8 74.88 16.73 1.05 6.09E-6

N-concurrent CMAC (lookup)

*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Operating frequency: 233 MHz

QFT emulation using on-chip memory (OCM) architectures

ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block

Page 22: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

25 ASAP 2019 – July 16th, 2019

Experimental Results

Number of qubits

On-chip resource* utilization (%) Emulation time (sec)**ALMs BRAMs DSPs

2 12.39 8.55 2.11 7.55E-7

3 12.34 8.55 2.11 9.61E-7

4 12.36 8.63 2.11 1.79E-6

5 12.43 8.70 2.11 5.08E-6

6 12.38 8.99 2.11 1.83E-5

7 12.39 10.69 2.11 7.1E-5

8 12.37 17.18 2.11 0.0003

9 12.37 43.54 2.11 0.0011

Dual sequential CMAC (lookup)

*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Operating frequency: 233 MHz

QFT emulation using on-chip memory (OCM) architectures

ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block

Page 23: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

26 ASAP 2019 – July 16th, 2019

Experimental Results

Comparison of execution times for on-chip memory (OCM) architectures

Page 24: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

27 ASAP 2019 – July 16th, 2019

Experimental Results

QFT emulation using on-board memory (OBM) architectures

Qubits

On-chip resource* utilization (%)

OBM** Utilization (bytes) Emulation time

(sec)***ALMs BRAM DSPs SRAM SDRAM

2 10.71 8.44 1.05 32 128 1.7E-63 10.71 8.44 1.05 64 512 2.0E-64 10.71 8.44 1.05 128 2K 3.9E-65 10.71 8.44 1.05 256 8K 1.1E-56 10.71 8.44 1.05 512 32K 3.9E-57 10.71 8.44 1.05 1K 128K 0.000158 10.71 8.44 1.05 2K 512K 0.000619 10.71 8.44 1.05 4K 2M 0.00241

10 10.71 8.44 1.05 8K 8M 0.0096311 10.71 8.44 1.05 16K 32M 0.0385112 10.71 8.44 1.05 32K 128M 0.1539913 10.71 8.44 1.05 64K 512M 0.6158614 10.71 8.44 1.05 128K 2G 2.3632415 10.71 8.44 1.05 256K 8G 9.85316 10.71 8.44 1.05 512K 32G 39.4209

*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Total on-board memory: 4 parallel SRAM banks of 8MB each and 2 parallel SDRAM banks of 32GB each***Operating frequency: 233 MHz

Qubits

On-chip resource* utilization (%)

OBM** Utilization (bytes) Emulation time

(sec)***ALMs BRAM DSPs SRAM SDRAM

2 12 8.63 2.11 32 128 7.55E-73 12 8.63 2.11 64 512 9.61E-74 12 8.63 2.11 128 2K 1.79E-65 12 8.63 2.11 256 8K 5.08E-66 12 8.63 2.11 512 32K 1.83E-57 12 8.63 2.11 1K 128K 7.10E-58 12 8.63 2.11 2K 512K 0.000289 12 8.63 2.11 4K 2M 0.0011310 12 8.63 2.11 8K 8M 0.0045111 12 8.63 2.11 16K 32M 0.01800212 12 8.63 2.11 32K 128M 0.07200613 12 8.63 2.11 64K 512M 0.288802114 12 8.63 2.11 128K 2G 1.15208315 12 8.63 2.11 256K 8G 4.60832916 12 8.63 2.11 512K 32G 18.4331

Single CMAC (lookup) Dual-sequential CMAC (lookup)

ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block

Page 25: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

28 ASAP 2019 – July 16th, 2019

Experimental Results

Comparison of execution times for on-board memory (OBM) architectures

Page 26: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

30 ASAP 2019 – July 16th, 2019

Experimental Results

QFT emulation using on-board memory (OBM) architectures

*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Total on-board memory: 4 parallel SRAM banks of 8MB each and 2 parallel SDRAM banks of 32GB each***Operating frequency: 233 MHz† Results projected using regression

Qubits

On-chip resource* utilization (%)

OBM** Utilization (bytes) Emulation time

(sec)***ALMs BRAMs DSPs SDRAM

2 11 8 1 32 2.3E-64 11 8 1 128 3.4E-66 11 8 1 512 2.0E-58 11 8 1 2K 2.8E-410 11 8 1 8K 4.5E-312 11 8 1 32K 7.2E-214 11 8 1 128K 1.15E016 11 8 1 512K 1.84E+118 11 8 1 2M 2.95E+220 11 8 1 8M 4.72E+322 11 8 1 32M 7.5E+4†24 11 8 1 128M 1.2E+6†26 11 8 1 512M 1.93E+7†28 11 8 1 2G 3.09E+8†30 11 8 1 8G 4.95E+9†32 11 8 1 32G 7.92E+10†

Dual-sequential CMAC (stream)Dual-sequential CMAC (dynamic generation)

ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block

Qubits

On-chip resource* utilization (%)

OBM** Utilization (bytes) Emulation time

(sec)***ALMs BRAMs DSPs SDRAM

2 13.16 9.58 3.23 32 1.99E-64 13.16 9.58 3.23 128 3.02E-66 13.16 9.58 3.23 512 1.95E-58 13.16 9.58 3.23 2K 0.000310 13.16 9.58 3.23 8K 0.004512 13.16 9.58 3.23 32K 0.072014 13.16 9.58 3.23 128K 1.152116 13.16 9.58 3.23 512K 18.43318 13.16 9.58 3.23 2M 294.9320 13.16 9.58 3.23 8M 4718.93422 13.16 9.58 3.23 32M 18876†24 13.16 9.58 3.23 128M 302012†26 13.16 9.58 3.23 512M 4832188†28 13.16 9.58 3.23 2G 7.73E+7 †30 13.16 9.58 3.23 8G 1.23E+9 †32 13.16 9.58 3.23 32G 1.979E+10 †

Page 27: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

31 ASAP 2019 – July 16th, 2019

Experimental Results

Number of pixels Qubits

On-chip resource* utilization (%)

OBM** Utilization (bytes) Emulation time

(sec)***ALMs BRAMs DSPs SDRAM

16x16 8 14 9 2 4K 6.73E-632x32 10 14 9 2 16K 1.66E-564x64 12 14 9 2 64K 5.62E-5

128x128 14 14 9 2 256K 0.0002256x256 16 14 9 2 1M 0.0008512x512 18 14 9 2 4M 0.0034

1024x1024 20 14 9 2 16M 0.01352048x2048 22 14 9 2 64M 0.0540

4Kx4K 24 14 9 2 256M 0.21608Kx8K 26 14 9 2 1G 0.8641

16Kx16K 28 14 9 2 4G 3.456332Kx32K 30 14 9 2 16G 13.825

QubitsOn-chip resource*

utilization (%)OBM** Utilization

(bytes) Emulation time (sec)***

ALMs BRAMs DSPs SDRAM2 11 8 1 32 2.3E-64 11 8 1 128 3.4E-66 11 8 1 512 2.0E-58 11 8 1 2K 2.8E-410 11 8 1 8K 4.5E-312 11 8 1 32K 7.2E-214 11 8 1 128K 1.15E016 11 8 1 512K 1.84E+118 11 8 1 2M 2.95E+220 11 8 1 8M 4.72E+322 11 8 1 32M 7.5E+4†24 11 8 1 128M 1.2E+6†26 11 8 1 512M 1.93E+7†28 11 8 1 2G 3.09E+8†30 11 8 1 8G 4.95E+9†32 11 8 1 32G 7.92E+10†

Grover’s search using OBM architecturesSingle CMAC (stream)

Quantum Haar transform (QHT) using OBM architecturesKernel approach

*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Total on-board memory: 4 parallel SRAM banks of 8MB each and 2 parallel SDRAM banks of 32GB each***Operating frequency: 233 MHz† Results projected using regression

ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block

Page 28: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

32 ASAP 2019 – July 16th, 2019

Experimental Results

Comparison with related work (FPGA emulation)Reported Work Algorithm Number of qubits Precision Operating

frequency (MHz)Emulation time (sec)

Fujishima (2003) Shor’s factoring - - 80 10

Khalid et al (2004)QFT 3 16-bit fixed pt.

82.161E-9

Grover’s search 3 16-bit fixed pt. 84E-9

Aminian et al (2008) QFT 3 16-bit fixed pt. 131.3 46E-9

Lee et al (2016)QFT 5 24-bit fixed pt. 90 219E-9

Grover’s search 7 24-bit fixed pt. 85 96.8E-9

Silva and Zabaleta(2017) QFT 4 32-bit floating pt. - 4E-6

Pilch and Dlugopolski(2018) Deutsch 2 - - -

Proposed work

QFT 20

32-bit floating pt. 233

18.4

QHT 30 13.8

Grover’s search 20 18.4

Page 29: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

33 ASAP 2019 – July 16th, 2019

Conclusions

Supremacy of Quantum Computing Need for Quantum Emulation Emulation using FPGAs

Proposed Approaches and Methods Direct circuit/gate approach CMAC approach + lookup / dynamic generation / stream Kernel approach

Case studies Quantum Fourier Transform (QFT) Multi-dimensional Quantum Haar Transform (QHT) Single-pattern / multi-pattern Grover’s search algorithm

Testbed Platform State-of-the-art HPRC system from DirectStream C++ to hardware compiler

Page 30: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

34 ASAP 2019 – July 16th, 2019

Future Work

More algorithms Integer factoring using Shor’s algorithm Dimension reduction using QHT Image pattern recognition using QHT and Grover’s search

Design Optimizations Combining / sharing resources, e.g., multipliers

Accuracy trade-off study Fixed-point vs. floating-point implementations for higher scalability

Quantum error correction (QEC)

Power efficiency

Page 31: A Scalable High-Precision and High-Throughput Architecture ... · 30th IEEE International Conference on. Application-specific Systems, Architectures and Processors. July 15-17, 2019

ASAP 2019 – July 16th, 2019