program-at-a-glance - aicas 2019 › img › document › aicas2019_technical-progra… · emerging...

29
Program-at-a-Glance Monday, Mar. 18 th 2019 Time \ Venue Ballroom C, 10F Ballroom D, 11F 08:30~ Registration 09:00-10:30 (90’) Tutorial 1-1 Connecting ONNX to Proprietary DLAs: An Introduction to Open Neural Network Compiler Luba Tang Skymizer, Taiwan Tutorial 2-1 Neuromorphic Artificial Intelligence Tobi Delbruck University of Zurich and ETH Zurich Switzerland 10:30-10:45 (15’) Coffee Break 10:45-12:15 (90’) Tutorial 1-2 Connecting ONNX to Proprietary DLAs: An Introduction to Open Neural Network Compiler Luba Tang Skymizer, Taiwan Tutorial 2-2 Neuromorphic Artificial Intelligence Tobi Delbruck University of Zurich and ETH Zurich Switzerland 12:15-13:15 (60’) Tutorial Lunch @Ballroom A, 10F 13:15-14:45 (90’) Tutorial 3 Memory-Centric Chip Architecture for Deep Learning Sungjoo Yoo Seoul National University, Korea Tutorial 4-1 BRAINWAY and Nano-Abacus Architecture: Brain- Inspired Cognitive Computing Using Energy Efficient Physical Computational Structures, Algorithms and Architecture Co-Design Andreas Andreou Johns Hopkins University, USA 14:45-15:00 (15’) Coffee Break 15:00-16:30 (90’) Tutorial 5 SRAM and RRAM Based In-Memory Computing for Deep Learning: Opportunities and Challenges Jae-sun Seo Arizona State University, USA Tutorial 4-2 BRAINWAY and Nano-Abacus Architecture: Brain- Inspired Cognitive Computing Using Energy Efficient Physical Computational Structures, Algorithms and Architecture Co-Design Andreas Andreou Johns Hopkins University, USA 16:30-16:40 (10’) Break 16:40-17:00 (20’) Opening Ceremony @Ballroom B, 10F 17:00-18:00 (60’) Keynote #1 @Ballroom B, 10F Re-Engineering Computing with Neuro-Inspired Learning: Devices, Circuits, and Systems Kaushik Roy Purdue University, USA 18:00-18:45 (45’) Keynote #2 @Ballroom B, 10F AI Transforming Hardware Design Chekib Akrout Synopsys, USA 18:45-21:00 (135’) Welcome Reception & Jeopardy @Ballroom A, 10F

Upload: others

Post on 08-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

Program-at-a-Glance

Monday, Mar. 18th 2019

Time \ Venue Ballroom C, 10F Ballroom D, 11F

08:30~ Registration

09:00-10:30 (90’) Tutorial 1-1

Connecting ONNX to Proprietary DLAs: An Introduction to Open Neural Network Compiler

Luba Tang Skymizer, Taiwan

Tutorial 2-1

Neuromorphic Artificial Intelligence

Tobi Delbruck University of Zurich and ETH Zurich Switzerland

10:30-10:45 (15’) Coffee Break

10:45-12:15 (90’) Tutorial 1-2

Connecting ONNX to Proprietary DLAs: An Introduction to Open Neural Network Compiler

Luba Tang Skymizer, Taiwan

Tutorial 2-2

Neuromorphic Artificial Intelligence

Tobi Delbruck University of Zurich and ETH Zurich Switzerland

12:15-13:15 (60’) Tutorial Lunch @Ballroom A, 10F

13:15-14:45 (90’) Tutorial 3

Memory-Centric Chip Architecture for Deep Learning

Sungjoo Yoo Seoul National University, Korea

Tutorial 4-1

BRAINWAY and Nano-Abacus Architecture: Brain-Inspired Cognitive Computing Using Energy Efficient Physical Computational Structures,

Algorithms and Architecture Co-Design

Andreas Andreou Johns Hopkins University, USA

14:45-15:00 (15’) Coffee Break

15:00-16:30 (90’) Tutorial 5

SRAM and RRAM Based In-Memory Computing for Deep Learning: Opportunities and Challenges

Jae-sun Seo Arizona State University, USA

Tutorial 4-2

BRAINWAY and Nano-Abacus Architecture: Brain-Inspired Cognitive Computing Using Energy Efficient Physical Computational Structures,

Algorithms and Architecture Co-Design

Andreas Andreou Johns Hopkins University, USA

16:30-16:40 (10’) Break

16:40-17:00 (20’) Opening Ceremony @Ballroom B, 10F

17:00-18:00 (60’) Keynote #1 @Ballroom B, 10F

Re-Engineering Computing with Neuro-Inspired Learning: Devices, Circuits, and Systems

Kaushik Roy Purdue University, USA

18:00-18:45 (45’) Keynote #2 @Ballroom B, 10F

AI Transforming Hardware Design

Chekib Akrout Synopsys, USA

18:45-21:00 (135’) Welcome Reception & Jeopardy @Ballroom A, 10F

Page 2: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

Tuesday, Mar. 19th, 2019

Time \ Venue Ballroom B, 10F Ballroom C, 10F Ballroom D, 11F

08:00 Registration

08:30-09:30 (60’) Keynote #3

How Edge AI Technology Is Redefining Smart Devices

Ryan Chen MediaTek Inc., Taiwan

09:30-10:50 (80’) Special Session 1

Smart Circuit Techniques for

Neural Networks

Lecture session 1

Deep Neural Network for

Computer Vision

Lecture session 2

Hardware Accelerators for AI

10:50-11:10 (20’) Coffee Break

11:10-12:30 (80’) Special Session 2

Edge and Fog Computing to

Enable AI in IoT

Lecture session 3

Neuromorphic Processors

Lecture session 4

Application Specific AI

Accelerators

12:30-13:30 (60’) Lunch @Ballroom A, 10F

13:30-15:00 (90’) Panel Discussion

AI Computing for Smart Life:

What, why, who, and where

15:00-16:00 (60’) WICAS & YP

Influences of EDGE Device’s

Instant Decision: From Bio-

Tech, FinTech to Sustainable

Energy & Beyond

16:00-16:20 (20’) Coffee Break

16:20-18:00(100’) Special Session 3

Analytics

Algorithm/Architecture for

Smart System Design

Lecture session 5

Deep Learning for Speech

and Low-dimensional Signal

Processing

18:30-20:30 Banquet @Ballroom A, 10F

Page 3: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

Wednesday, Mar. 20th, 2019

Time \ Venue Ballroom B, 10F Ballroom C, 10F Ballroom D, 11F

08:00 Registration

08:30-09:30 (60’) Keynote #4

Edge Intelligence for Optimized Systems & High-Performance

Devices

Anthony Vetro MERL, USA

09:30-10:30 (60’) Special Session/Forum

2018 Low-Power Image

Recognition Challenge and

Beyond

Lecture session 6

Medical AI (I)

Industrial Session 1

AI Computing Platform

10:30-10:50 (20’) Coffee Break

10:50-12:10 (80’) Special Session 4

Intelligent Processing of

Time-series Signals

Lecture session 7

Medical AI (II)

Industrial Session 2

Compiler Technology for AI

Chip

12:10-13:10 (60’) Lunch @8F

13:10-14:40 (90’) Poster session/Live Demo/Showcase

@ Ballroom A, 10F

14:40-16:00 (80’) Special Session 5

Emerging Memory

Technologies for

Neuromorphic Circuits and

Systems

Lecture session 8

Low Precision Neural

Network

16:00-16:20 (20’) Coffee Break

16:20-17:40 (80’) Special Session 6

AI in Advanced Applications

Lecture session 9

Hardware Oriented Neural

Network Optimization

Page 4: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

KN1 Keynote 1

Monday, March 18|17:00-18:00

Ballroom B, 10F

Chair(s):

Chen-Hao Chang, National Chung Hsing University, Taiwan

Re-Engineering Computing with Neuro-Inspired Learning: Devices, Circuits, and Systems

Kaushik Roy

Purdue University, USA

KN2 Keynote 2

Monday, March 18|18:00-18:45

Ballroom B, 10F

Chair(s):

Mohamad Sawan, Westlake University, China

AI Transforming Hardware Design

Chekib Akrout

Synopsys, USA

KN3 Keynote 3

Tuesday, March 19|08:30-09:30

Ballroom B, 10F

Chair(s): David Brooks, Harvard University, USA

How Edge AI Technology is Redefining Smart Devices

Ryan Chen

MediaTek Inc., Taiwan

SS01 Special Session 1 Smart Circuit Techniques for Neural Networks

Tuesday, March 19|09:30-10:50

Ballroom B, 10F

Chair(s):

Ren-Shuo Liu, National Tsing Hua University, Taiwan

Mohamad Sawan, Westlake University, China

SS01.1 09:30-09:50 Auto Generation of High-Performance Fixed-Point Multiplier for Artificial Neural Networks

Yang Zhao*, Zhongxia Shang, Yong Lian

York University, Canada

Multiplier is a critical building block in artificial neural network (ANN). The precision and connection structure of the multiplier should be optimized for an ANN to achieve the best energy, speed and area efficiency. Changes in ANN application and CMOS process often result in the redesign of the multiplier. This paper presents an auto generation method for high-performance fixed-point multiplier based on three techniques, i.e. Modified Booth Encoding (MBE) scheme, improved three-dimensional reduction method (ITDM) and mixed parallel pipelining (MPP). The MBE is customized for ReLU activation function based ANN to remove the sign bit of the multiplicand to save area. The ITDM further shorts the critical path by changing the position of half adder in the conventional TDM. The proposed MPP divides the structures into different stages for parallel and pipelined implementation. The auto generated multiplier speed is 4.04 times faster and the layout is 29% denser and more regular than the conventional MBE combining with TDM method based multiplier.

SS01.2 09:50-10:10 Sub-Word Parallel Precision-Scalable MAC Engines for Efficient Embedded DNN Inference

Linyan Mei*1, Mohit Dandekar1, Dimitrios Rodopoulos2, Jeremy Constantin2, Peter Debacker2, Rudy Lauwereins2, Marian Verhelst1

1KU Leuven, Belgium 2imec, Belgium

To enable energy-efficient embedded execution of Deep Neural Networks (DNNs), the critical sections of these workloads, their multiply-accumulate (MAC) operations, need to be carefully optimized. The SotA pursues this through run-time precision-scalable MAC operators, which can support the varying precision needs of DNNs in an energy-efficient way. Yet, to implement the adaptable precision MAC operation, most SotA solutions rely on separately optimized low precision multipliers and a precision-variable accumulation scheme, with the possible disadvantages of a high control complexity and degraded throughput. This paper, first optimizes one of the most effective SotA techniques to support fully-connected DNN layers. This mode, exploiting the transformation of a high precision multiplier into independent parallel low-precision multipliers, will be called the Sum Separate (SS) mode. In addition, this work suggests an alternative low-precision scheme, i.e. the implicit accumulation of multiple low precision products within the multiplier itself, called the Sum Together (ST) mode. Based on the two types of MAC arrangements explored, corresponding architectures have been

Page 5: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

proposed to implement DNN processing. The two architectures, yielding the same throughput, are compared in different working precisions (2/4/8/16-bit), based on Post-Synthesis simulation. The result shows that the proposed ST-Mode based architecture outperforms the earlier SS-Mode by up to x1.6 on Energy Efficiency (TOPS/W) and x1.5 on Area Efficiency (GOPS/mm2 ).

SS01.3 10:10-10:30 On-chip Learning of Multilayer Perceptron Based on Memristors with Limited Multilevel States

Yuhang Zhang1, Guanghui He1, Kea-Tiong Tang2, Guoxing Wang*1 1Shanghai Jiao Tong University, China 2National Tsing Hua University, Taiwan

The cross-point memristor array is viewed as a promising candidate for neuromorphic computing due to its non-volatile storage and parallel computing features. However, the programming threshold and resistance fluctuation among different multilevel states restrict the capacity of weight representation and thus numerical precision. This poses great challenges for on-chip learning. This work evaluates the deterioration of learning accuracy on multilayer perceptron due to limited multilevel states and proposes stochastic “skip-and-update” algorithm to facilitate on-chip learning with low-precision memristors.

SS01.4 10:30-10:50 Memristor Emulators for an Adaptive DPE Algorithm: Comparative Study

Hussein Assaf*1, Yvon Savaria1, Mohamad Sawan1,2 1Polytechnique Montreal, Canada 2Westlake University, and Westlake Institute for Advanced Study, China

Vector Matrix Multiplication (VMM) is a complex operation requiring large computational power to fulfill one iteration. Resistive computing; including memristors, is one solution to speed up VMM by optimizing the multiplication process into few steps despite the matrices' sizes. In this paper, we propose an Adaptive Dot Product Engine (ADPE) algorithm based on memristors for enhancing the process of resistive computing in VMM. The algorithm showed 5% error on preliminary results with one on-line training step for one layered crossbar array circuit of memristors. However memristors require new fabrication technologies where the design and validation processes of systems using these devices remains challenging. A comparison of various available circuits emulating a memristor suitable for ADPE is presented and models were compared based on chip size, circuit elements used and operating frequency.

L1 Lecture Session 1 Deep Neural Network for Computer Vision

Tuesday, March 19|09:30-10:50

Ballroom C, 10F

Chair(s):

Chris Gwo Giun Lee, National Cheng Kung University, Taiwan

L1.1 09:30-09:50 Deep Multi-Scale Residual Learning-based Blocking Artifacts Reduction for Compressed Images

Min-Hui Lin1, Chia-Hung Yeh1,2, Chu-Han Lin1, Li-Wei Kang*3, Chih-Hsiang Huang1 1National Sun Yat-sen University, Taiwan 2National Taiwan Normal University, Taiwan 3National Yunlin University of Science and Technology, Taiwan

Blocking artifact, characterized by visually noticeable changes in pixel values along block boundaries, is a general problem in block-based image/video compression systems. Various post-processing techniques have been proposed to reduce blocking artifacts, but most of them usually introduce excessive blurring or ringing effects. This paper presents a deep learning-based compression artifacts reduction (or deblocking) framework relying on multi-scale residual learning. Recent popular approaches usually train deep models using a per-pixel loss function with explicit image priors for directly producing deblocked images. Instead, we formulate the problem as learning the residuals (or the artifacts) between original and the corresponding compressed images. In our deep model, each input image is down-scaled first with blocking artifacts naturally reduced. Then, the learned SR (super-resolution) convolutional neural network (CNN) will be used to up-sample the down-scaled version. Finally, the up-scaled version (with less artifacts) and the original input are fed into the learned artifact prediction CNN to obtain the estimated blocking artifacts. As a result, the blocking artifacts can be successfully removed by subtracting the predicted artifacts from the input image while preserving most original visual details.

L1.2 09:50-10:10 Complexity Reduction on HEVC Intra Mode Decision with modified LeNet-5

Hai-Che Ting, Hung-Luen Fang, Jia-Shung Wang*

National Tsing Hua University, Taiwan

The HEVC (H.265) standard was finalized in April 2013, currently being as the prevalent video coding standard. One key contributor to the performance gain over H.264 is the intra prediction that extended a large number of prediction directions on various sizes of prediction units (PUs), thus at a cost of very high computational complexity.

Page 6: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

When HEVC has been emerged, several fast Intra prediction and CU size decision algorithms are being developed for practical applications. Actually, these two components would cost around 60% to 70% encoding time in the all-intra HEVC encoding. In this paper, a novel CNN-based solution is proposed and evaluated. The main idea is to elect a smallest set of adequate intra directions using our modified LeNet-5 CNN model, thus reduce the computational complexity of (further) rate distortion optimization to a tolerable limit. Besides, we also employ two filters: the edge strength extractor in [4] and the early terminated CU partition in [12] to skip most of the unlikely directions and to decrease the number of CUs, respectively. The experimental results demonstrate that the proposed method provides a decrease of up to 66.59% computation with a slightly increase in the bit-rate (1.1% on average) and a little reduction of picture quality (0.109% on average in PSNR) at most.

L1.3 10:10-10:30 Fast event-driven incremental learning of hand symbols

Iulia Alexandra Lungu*, Shih-Chii Liu, Tobi Delbruck

University of Zurich and ETH Zurich, Switzerland

This paper describes a hand symbol recognition system that can quickly be trained to incrementally learn to recognize new symbols using about 100 times less data and time than by using conventional training. It is driven by frames from a Dynamic Vision Sensor (DVS) event camera. Conventional cameras have very redundant output, especially at high frame rates. Dynamic vision sensors output sparse and asynchronous brightness change events that occur when an object or the camera is moving. Images consisting of a fixed number of events from a DVS drive recognition and incremental learning of new hand symbols in the context of a RoShamBo (rock-paper- scissors) demonstration. Conventional training on the original RoShamBo dataset requires about 12.5h compute time on a desktop GPU using the 2.5 million images in the base dataset. Novel symbols that a user shows for a few tens of seconds to the system can be learned on-the-fly using the iCaRL incremental learning algorithm with 3 minutes of training time on a desktop GPU, while preserving recognition accuracy of previously trained symbols. Our system runs a residual network with 32 layers and maintains 88.4% after 100 epochs or 77% after 5 epochs overall accuracy after 4 incremental training stages. Each stage adds an additional 2 novel symbols to the base 4 symbols. The paper also reports an inexpensive robot hand used for live demonstrations of the base RoShamBo game.

L1.4 10:30-10:50 Slasher: Stadium racer for end-to-end event-based camera autonomous driving experiments

Yuhuang Hu*, Hong Ming Chen, Tobi Delbruck

University of Zurich and ETH Zurich, Switzerland

Slasher is the first open 1/10 scale autonomous driving platform for exploring the use of event-based cameras for fast driving in unstructured indoor and outdoor environments. Slasher features a DAVIS event-based camera and ROS computer for perception and control. The DAVIS camera provides high dynamic range, sparse output, and sub-millisecond latency output for the quick visual control needed for fast driving. A race controller and Bluetooth remote joystick are used to coordinate different processing pipelines, and a low-cost ultra-wide-band (UWB) positioning system records trajectories. The modular design of Slasher can easily integrate additional features and sensors. In this paper, we set up a reflexive Convolutional Neural Network (CNN) controller for demonstrating the platform for end-to-end training. We present preliminary experiments in closed-loop indoor and outdoor trail driving.

L2 Lecture Session 2 Hardware Accelerators for AI

Tuesday, March 19|09:30-10:50

Ballroom D, 11F

Chair(s):

Lan-Da Van, National Chiao Tung University, Taiwan

L2.1 09:30-09:50 A CMOS-based Resistive Crossbar Array with Pulsed Neural Network for Deep Learning Accelerator

Injune Yeo*, Sang-gyun Gi, Jung-gyun Kim, Byung-geun Lee

School of Electrical Engineering and Computer Sicence Gawangju Institute of Science and Technology (GIST), Korea

A CMOS-based resistive computing element (RCE), which can be integrated in a crossbar array, is presented. The RCE successfully solves the hardware constraints of the existing memristive devices such as dynamic ranges of conductance, I-V nonlinearity, and on/off ratio without increasing hardware complexity compared to other CMOS implementations. The RCE has been designed using a 65nm standard CMOS process and SPICE simulations have been performed to evaluate feasibility and functionality of the RCE. In addition, a pulsed neural network employing an RCE crossbar array has also been designed and simulated to verify the operation of the RCE.

Page 7: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

L2.2 09:50-10:10 CNNP-v2:An Energy Efficient Memory-Centric Convolutional Neural Network Processor Architecture

Sungpill Choi*, Kyeongryeol Bong, Donghyeon Han, Hoi-Jun Yoo

KAIST, Korea

An energy efficient memory-centric convolutional neural network (CNN) processor architecture is proposed for smart devices such as wearable devices or internet of things (IoT) devices. To achieve energy-efficient processing, it has 2 key features: First, 1-D shift convolution PEs with fully distributed memory architecture achieve 3.1TOPS/W energy efficiency. Compared with conventional architecture, even though it has massively parallel 1024 MAC units, it achieve high energy efficiency by scaling down voltage to 0.46V due to its fully local routed design. Next, fully configurable 2-D mesh core-to-core interconnection support various size of input features to maximize utilization. The proposed architecture is evaluated 16mm2 chip which is fabricated with 65nm CMOS process and it performs real-time face recognition with only 9.4mW at 10MHz and 0.48V.

L2.3 10:10-10:30 An Energy-Efficient Accelerator with Relative-Indexing Memory for Sparse Compressed Convolutional Neural Network

I-Chen Wu1, Po-Tsang Huang*2, Chin-Yang Lo1, Wei Hwang1,2 1Department of Electronics Engineering, National Chiao Tung University, Taiwan 2International College of Semiconductor Technology, National Chiao Tung University, Taiwan

Deep convolutional neural networks (CNNs) are widely used in image recognition and feature classification. However, deep CNNs are hard to be fully deployed for edge devices due to both computation-intensive and memory-intensive workloads. The energy efficiency of CNNs is dominated by off-chip memory accesses and convolution computation. In this paper, an energy-efficient accelerator is proposed for sparse compressed CNNs by reducing DRAM accesses and eliminating zero-operand computation. Weight compression is utilized for sparse compressed CNNs to reduce the required memory capacity/bandwidth and a large portion of connections. Thus, ReLU function produces zero-valued activations. Additionally, the workloads are distributed based on channels to increase the degree of task parallelism, and all-row-to-all-row non-zero element multiplication is adopted for skipping redundant computation. The simulation results over the dense accelerator show that the proposed accelerator achieves 1.79x speedup and reduces 23.51%, 69.53%, 88.67% on-chip memory size, energy, and DRAM accesses of VGG-16.

L2.4 10:30-10:50 Accelerator Design for Vector Quantized Convolutional Neural Network

Yi-Heng Wu*, Heng Lee, Yu Sheng Lin, Shao-Yi Chien

National Taiwan University, Taiwan

In recent years, deep convolutional neural networks (CNNs) achieve ground-breaking success in many computer vision research fields. Due to the large model size and tremendous computation of CNNs, they cannot be efficiently executed in small devices like mobile phones. Although several hardware accelerator architectures have been developed, most of them can only efficient address one of the two major layers in CNN, convolutional (CONV) and fully connected (FC) layers. In this paper, based on algorithm-architecture-co-exploration, our architecture targets at executing both layers with high efficiency. Vector quantization technique is first selected to compress the parameters, reduce the computation, and unify the behaviors of both CONV and FC layers. To fully exploit the gain of vector quantization, we then propose an accelerator architecture for quantized CNN. Different DRAM access schemes are employed to reduce DRAM access. We also design a high-throughput processing element architecture to accelerate quantized layers. Compare to previous accelerators for CNN, the proposed architecture achieves 1.2--5x less DRAM access and 1.5--5x higher throughput for both CONV and FC layers.

SS02 Special Session 2 Edge and Fog Computing to Enable AI in IoT

Tuesday, March 19|11:10-12:30

Ballroom B, 10F

Chair(s):

Zhuo Zou, Fudan University, China

Tomi Westerlund, University of Turku, Finland

SS02.1 11:10-11:30 Edge and Fog Computing enabled AI for Internet of Things

Zhuo Zou*1, Yi Jin1, Paavo Nevalainen2, Yuxiang Huan1, Jukka Heikkonen2, Tomi Westerlund2 1Fudan University, China 2University of Turku, Finland

In recent years, Artificial Intelligence (AI) has been widely deployed in a variety of business sectors and industries, yielding numbers of revolutionary applications and services that are primarily driven by high-performance computation and storage facilities in the cloud. On the other hand, embedding intelligence into edge devices is highly demanded by emerging applications areas such as autonomous systems, human-machine interactions, and

Page 8: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

the Internet of Things (IoT). In these applications, it is advantageous to process data near or at the source of data to improve energy & spectrum efficiency and security and decrease latency. Although the computation capability of edge devices has increased tremendously during the past decade, it is still challenging to perform sophisticated AI algorithms in these resource-constrained edge devices, which calls for not only low-power chips for energy efficient processing at the edge, but also a system-level framework to distribute resources and tasks along the edge-cloud continuum. In this overview, we summarize dedicated edge hardware for embedded learning from mobile applications to sub-mW “always-on” IoT nodes. Recent advances of circuits and systems incorporating joint design of architectures and algorithms will be reviewed. Fog computing paradigm that enables processing at the edge while still offering the possibility to interact with the cloud will be covered, with focus on opportunities and challenges of exploiting fog computing in AI as a bridge between the edge device and the cloud.

SS02.2 11:30-11:50 Survey of Precision-Scalable Multiply-Accumulate Units for Neural-Network Processing

Vincent Camus*1,2, Christian Enz1, Marian Verhelst2 1ICLAB, EPFL, Switzerland 2ESAT-MICAS, KU Leuven, Belgium

The current trend for deep learning has come with an enormous computational need for billions of Multiply-Accumulate (MAC) operations per inference. Fortunately, reduced precision has demonstrated large benefits with low impact on accuracy, paving the way towards processing in mobile devices and IoT nodes. Precision-scalable MAC architectures optimized for neural networks have recently gained interest thanks to their subword parallel or bit-serial capabilities. Yet, it has been hard to make a fair judgment of their relative benefits as they have been implemented with different technologies and performance targets. In this work, run-time configurable MAC units from ISSCC 2017 and 2018 are implemented and compared objectively under diverse precision scenarios. All circuits are synthesized in a 28nm commercial CMOS process with precision ranging from 2 to 8 bits. This work analyzes the impact of scalability and compares the different MAC units in terms of energy, throughput and area, aiming to understand the optimal architectures to reduce computation costs in neural-network processing.

SS02.3 11:50-12:10 Towards Workload-Balanced, Live Deep Learning Analytics for Confidentiality-Aware IoT Medical Platforms

Jose Granados*, Haoming Chu, Zhuo Zou, Lirong Zheng

Fudan University, China

Internet of Things (IoT) applications for healthcare are one of the most studied aspects in the research landscape due to the promise of more efficient resource allocation for hospitals and as a companion tool for health professionals. Yet, the requirements in terms of low power, latency and knowledge extraction from the large amount of physiological data generated represent a challenge to be addressed by the research community. In this work, we examine the balance between power consumption, performance and latency among edge, gateway, fog and cloud layers in an IoT medical platform featuring inference by Deep Learning models. We setup an IoT architecture to acquire and classify multichannel electrocardiogram (ECG) signals into normal or abnormal states which could represent a clinically relevant condition by combining custom embedded devices with contemporary open source machine learning packages such as TensorFlow. Different hardware platforms are tested in order to find the best compromise in terms of convenience, latency, power consumption and performance. Our experiments indicate that the real time requisites are fulfilled, however there is a need to reduce energy expenditure by means of incorporating low power SoCs with integrated neuromorphic blocks.

SS02.4 12:10-12:30 Artificial Intelligence of Things Wearable System for Cardiac Disease Detection

Yu-Jin Lin1, Chen-Wei Chuang1, Chun-Yueh Yen1, Sheng-Hsin Huang1, Peng-Wei Huang1, Ju-Yi Chen2, Shuenn-Yuh Lee*1 1Department of Electrical Engineering, National Cheng Kung University, Taiwan 2Division of Cardiology, Department of Internal Medicine, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Taiwan

This study proposes an artificial intelligence of things (AIoT) system for electrocardiogram (ECG) analysis and cardiac disease detection. The system includes a front-end IoT-based hardware, a user interface on smart device’s application (APP), a cloud database, and an AI platform for cardiac disease detection. The front-end IoT-based hardware, a wearable ECG patch that includes an analog front-end circuit and a Bluetooth module, can detect ECG signals. The APP on smart devices can not only display users’ real-time ECG signals but also label unusual signals instantly and reach real-time disease detection. These ECG signals will be uploaded to the cloud database. The cloud database is used to store each user’s ECG signals, which forms a big-data database for AI algorithm to detect cardiac disease. The algorithm proposed by this study is

Page 9: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

based on convolutional neural network and the average accuracy is 94.96%. The ECG dataset applied in this study is collected from patients in Tainan Hospital, Ministry of Health and Welfare. Moreover, signal verification was also performed by a cardiologist.

L3 Lecture Session 3 Neuromorphic Processors

Tuesday, March 19|11:10-12:30

Ballroom C, 10F

Chair(s):

Chiara Bartolozzi, Istituto Italiano di Tecnologia, Italy

L3.1 11:10-11:30 Robust Learning and Recognition of Visual Patterns in Neuromorphic Electronic Agents

Dongchen Liang*, Raphaela Kreiser, Carsten Nielsen, Ning Qiao, Yulia Sandamirskaya, Giacomo Indiveri

University of Zurich and ETH Zurich, Switzerland

Mixed-signal analog/digital neuromorphic circuits are characterized by ultra-low power consumption, real-time processing abilities, and low-latency response times. These features make them promising for robotic applications that require fast and power-efficient computing. However, the unavoidable variance inherently existing in the analog circuits makes it challenging to develop neural processing architectures able to perform complex computations robustly. In this paper, we present a spiking neural network architecture with spike-based learning that enables robust learning and recognition of visual patterns in noisy silicon neural substrate and noisy environments. The architecture is used to perform pattern recognition and inference after a training phase with computers and neuromorphic hardware in the loop. We validate the proposed system in a closed-loop hardware setup composed of neuromorphic vision sensors and processors, and we present experimental results that quantify its real-time and robust perception and action behavior.

L3.2 11:30-11:50 DropOut and DropConnect for Reliable Neuromorphic Inference under Energy and Bandwidth Constraints in Network Connectivity

Yasufumi Sakai*1,2, Bruno Umbria Pedroni2, Siddharth Joshi3, Abraham Akinin2, Gert Cauwenberghs2 1Fujitsu Laboratories Ltd. 2University of California, San Diego, La Jolla, USA 3University of Notre Dame, Notre Dame, USA

DropOut and DropConnect are known as effective methods to improve on the generalization performance of

neural networks, by either dropping states of neural units or dropping weights of synaptic connections randomly selected at each time instance throughout the training process. In this paper, we extend on the use of these methods in the design of neuromorphic spiking neural networks (SNN) hardware to improve further on the reliability of inference as impacted by resource constrained errors in network connectivity. Such energy and bandwidth constraints arise for low-power operation in the communication between neural units, which cause dropped spike events due to timeout errors in the transmission. The DropOut and DropConnect processes during training of the network are aligned with a statistical model of the network during inference that accounts for these random errors in the transmission of neural states and synaptic connections. The use of DropOut and DropConnect during training hence allows to simultaneously meet two design objectives: maximizing bandwidth, while minimizing energy of inference in neuromorphic hardware. Simulations of the model with a 5-layer fully connected 784-500-500-500-10 SNN on the MNIST task show a 5-fold and 10-fold improvement in bandwidth during inference at greater than 98% accuracy, using DropOut and DropConnect respectively during backpropagation training.

L3.3 11:50-12:10 Conversion of Synchronous Artificial Neural Network to Asynchronous Spiking Neural Network using sigma-delta quantization

Amirreza Yousefzadeh1, Sahar Hosseini2, Priscila Holanda1, Sam Leroux1, Thilo Werner1, Teresa Serrano-Gotarredona2, Bernabe Linares Barranco*2, Bart Dhoedt1, Pieter Simoens1 1Ghent University-imec, IDLab, Belgium 2Instituto de Microelectronica de Sevilla (CSIC and Univ. de Sevilla), Sevilla, Spain

Artificial Neural Networks (ANNs) show great performance in several data analysis tasks including visual and auditory applications. However, direct implementation of these algorithms without considering the sparsity of data requires high processing power, consume vast amounts of energy and suffer from scalability issues. Inspired by biology, one of the methods which can reduce power consumption and allow scalability in the implementation of neural networks is asynchronous processing and communication by means of action potentials, so-called spikes. In this work, we use the well-known sigma-delta quantization method and introduce an easy and straightforward solution to convert an Artificial Neural Network to a Spiking Neural Network which can be implemented asynchronously in a neuromorphic platform. Briefly, we used asynchronous spikes to communicate the quantized output activations of the neurons. Despite the fact that our proposed mechanism is simple and applicable to a wide range of different ANNs, it outperforms the state-of-the-art implementations from the accuracy and energy

Page 10: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

consumption point of view. All source code for this project is available upon request for the academic purpose

L3.4 12:10-12:30 Neuromorphic networks using silicon retina on the SpiNNaker platform

Germain Haessig*1,2, Francesco Galluppi2, Xavier Lagorce2, Ryad Benosman2,3 1Institute of Neuroinformatics, Unversity of Zurich and ETH Zurich, Switzerland 2Institut de la Vision, Sorbonne Universite, France 3University of Pittsburgh, Medical Center, USA

This paper describes spike-based neural networks for optical flow and stereo estimation from Dynamic Vision Sensors data. These methods combine the Asynchronous Time-based Image Sensor with the SpiNNaker platform. The sensor generates spikes with sub-millisecond resolution in response to scene illumination changes. These spike are processed by a spiking neural network running on SpiNNaker with a 1 millisecond resolution to accurately determine the order and time difference of spikes from neighboring pixels, and therefore infer the velocity, direction or depth. The spiking neural networks are a variant of the Barlow-Levick method for optical flow estimation, and Marr\& Poggio for the stereo matching.

L4 Lecture Session 4 Application Specific AI Accelerators

Tuesday, March 19|11:10-12:30

Ballroom D, 11F

Chair(s):

Chia-Hsiang Yang, National Taiwan University, Taiwan

Tobi Delbruck, Inst. of Neuroinformatics, UZH & ETH

Zurich, Switzerland

L4.1 11:10-11:30 A Flexible and High-Performance Self-Organizing Feature Map Training Acceleration Circuit and Its Applications

Yu-Hsiu Sun, Tzi-Dar Chiueh*

National Taiwan University, Taiwan

Self-organizing feature map (SOFM) is a type of artificial neural network based on an unsupervised learning algorithm. In this work, we present a circuit for accelerating SOFM training, which forms the foundation for an effective, efficient, and flexible SOFM training platform for different network geometries, including array, rectangular, and binary tree. FPGA validation was also conducted to examine the speedup ratio of this circuit when compared with training using software. In addition,

we applied our design to three applications: chromaticity diagram learning, MNIST handwritten numeral auto-labeling, and image vector quantization. All three experiments show that the proposed circuit architecture indeed provides a high-performance and cost-effective solution to SOFM training.

L4.2 11:30-11:50 A 2.17mW Acoustic DSP Processor with CNN-FFT Accelerators for Intelligent Hearing Aided Devices

Yu-Chi Lee1, Tai-Shih Chi2, Chia-Hsiang Yang*1,3 1Graduate Institute of Electronics Engineering, National Taiwan University, Taiwan 2National Chiao Tung University, Taiwan 3Department of Electrical Engineering, National Taiwan University, Taiwan

This paper proposes an acoustic DSP processor with a neural network core for speech enhancement. Accelerators for convolutional neural network (CNN) and fast Fourier transform (FFT) are embedded. The CNN-based speech enhancement algorithm takes the speech signal’s spectrogram as the model's input, and predicts the desired mask of speech to enhance speech intelligibility after passing through the CNN model. An array of multiply-accumulator (MAC) and coordinate rotation digital computer (CORDIC) engines are deployed to efficiently compute linear and nonlinear functions. Hardware sharing is applied to reduce hardware area by leveraging the high similarity between CNN and FFT computations. The proposed DSP processor chip is fabricated in a 40-nm CMOS technology with a core area of 4.3 mm^2. The chip's power dissipation is 2.17 mW at an operating frequency of 5 MHz. The CNN accelerator supports both convolutional and fully-connected layers and achieves an energy efficiency of 1200-to-2180 GOPS/W, despite the flexibility for FFT. The speech intelligibility can be enhanced by up to 41% under low SNR conditions.

L4.3 11:50-12:10 A Customized Convolutional Neural Network Design Using Improved Softmax Layer for Real-time Human Emotion Recognition

Kai-Yen Wang*, Yu-De Huang, Yun-Lung Ho, Nicolas Fahier, Wai-Chi Fang

National Chiao Tung University, Taiwan

This paper proposes an improved Softmax layer algorithm and hardware implementation, which is applicable to an effective convolutional neural network of EEG-based real-time human emotion recognition. Compared with the general Softmax layer, this hardware design adds threshold layers to accelerate the training speed and replace the Euler's base value with a dynamic base value to improve the network accuracy. This work also shows a

Page 11: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

hardware-friendly way to implement batch normalization layer on chip. Using the EEG emotion DEAP[7] database, the maximum and mean classification accuracy were achieved as 96.03% and 83.88% respectively. In this work, the usage of improved Softmax layer can save up to 15% of training model convergence time and also increase by 3 to 5% the average accuracy.

L4.4 12:10-12:30 Context-Preserving Filter Reorganization for VDSR-Based Super-resolution

Donghyeon Lee1, Sangheon Lee1, Ho Seong Lee1, Kyujoong Lee*2, Hyuk-Jae Lee1 1Seoul National University, Korea 2Sunmoon University, Korea

This paper presents a hardware design to process a CNN for single image super-resolution (SISR). Very deep convolutional network for image super-resolution (VDSR) is a promising algorithm for SISR but it is too complex to be implemented in hardware for commercial products. The proposed design aims to implement VDSR with relatively small hardware resources while minimizing a degradation of image quality. To this end, 1D reorganization of a convolution filter is proposed to reduce the number of multipliers. In addition, the 1D vertical filter is changed to reduce the internal SRAM to store the input feature map. For the implementation with a reasonable hardware cost, the numbers of layers and channels per layer as well as the parameter resolution are decreased without a significant reduction of image quality which is observed from simulation results. The 1D reorganization reduces the number of multiplies to 55.6% whereas the size reduction of 1D vertical filter halves the buffer size. As a result, the proposed design processes a full-HD video in real time with 8,143.5k gates and 333.1kB SRAM while the image quality is degraded by 1.06dB when compared with VDSR.

Panel Discussion

AI Computing for Smart Life: What, Why, Who, and Where

Tuesday, March 19|13:30-15:00

Ballroom B, 10F

Chair(s): David Brooks, Harvard University, USA

Abstract: Artificial Intelligence has seen fast-growing applications in our daily life including data analysis, healthcare, autonomous driving, life sciences, digital transformation, FinTech, security (cybersecurity and others), IoT, next-generation smart building technologies, robotics, various consumer applications and so much

more. On top of that, AI computing devices, from servers to edges, are being developed and deployed at a fast pace. Seeing the great opportunities of AI computing, this panel aims at exchanging the ideas between panelists and the audience about the 4W of emerging AI computing will play an important role in smart life: what, why, who, and where.

Panelists:

Andrew Kahng (ACM & IEEE Fellow, Professor, UCSD)

Yen-Kuang Chen (IEEE Fellow, Alibaba USA, formerly a Principal Research Scientist of Intel)

Chun-Chen Liu (CEO, Kneron)

Tihao Chiang (IEEE Fellow, VP, Ambarella)

WICAS & YP (Industry Forum for YP & WIE)

Influences of EDGE Device’s Instant Decision: From Bio-Tech, FinTech to Sustainable Energy & Beyond

Tuesday, March 19|15:00-16:00

Ballroom B, 10F

Chair(s):

Chris Gwo Giun Lee, National Cheng Kung University, Taiwan

Co-organized as Industry Forum with IEEE Region 10 Industry Relations Committee

Abstract: In view of aging societies, significance of digital economy and global warming, this Industry Forum is organized together with Young Professionals (YP) and Women in Engineering (WIE) within IEEE Circuits and Systems Society with industry leaders, investors, and venture capitalists invited to share their visions on how current state-of-the-art in EDGE/mobile devices, capable of near real-time decisions, may influence our daily life from the perspectives of healthcare, finance, and energy. Upon addressing the pain spots of these industry sectors, internship and potential industry/academia collaboration opportunities upon current Technology 4.0 entrepreneur landscape will also be speculated

15:00~15:05 Overview

Moderator: Chris Gwo Giun Lee, National Cheng Kung University, Taiwan

15:05~15:20

Artificial intelligence in Medicare: benefits and opportunities

Jessie Yu-Shin Wang

Biomedical Technology and Device Research Laboratories, ITRI

Abstract: Artificial intelligence is the hottest buzzword in recent medical and healthcare expo or exhibitions, as more than half of the global learer expect an expansion of

Page 12: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

AI in monitoring and diagnosis equipment. Whether it is computer-assisted technology, decision making process, robotic, or manufacturing automation, the new technology is changing our lifestyle and accelerate product development process. In the presentation, the mHealth solution powered by new-generation AI developed by ITRI is introduced. We also investigate the mHealth product marketing from the regulatory perspective and explore the possibilities of R&D roadmap implementation in medicare.

Topics: Mobile Health in “making healthcare more accessible” via “Artificial Intelligence (AI) as an extension of telemedicine”. How medical expertise and experiences could be extended in reaching out towards remote or home care facilities via AI. Internship opportunities or potential academia collaborations with ITRI will be discussed.

15:20~15:35 What banks are doing in Fintech

Michelle Wang

Senior Vice President, Head of Big Data Intelligence Center Taipei Fubon Commercial Bank

Honorable Guests: EVP Sheila Chuang & Digital Banking Advisor Wei-Bin Lee

Topics: Sharing how Fubon’s mobile APP software may provide more customized services for personalized finance management with intermediacy. Experience sharing on leadership in enterprises and Fubon’s internship and/or potential collaboration opportunities with academia will also be given.

15:35~15:50 Rapid Evolution of Disruptive Digital Transformation to Society 5.0. Global Trend on Impact of Fintech by AI and Blockchain Technology

Andy Chen

President & CEO of Catronic Enterprise, Founding Managing Partner, REDDS Capital

Topics: Global vision sharing on Fintech and current entrepreneur landscape in the linking of distributed energy data via blockchain and corresponding cybersecurity.

15:50~16:00 Q&A

SS03 Special Session 3 Analytics Algorithm/Architecture for Smart System Design

Tuesday, March 19|16:20-18:00

Ballroom B, 10F

Chair(s):

Shuvra S. Bhattacharyya, University of Maryland, USA

SS03.1 16:20-16:40 A Framework for Design and Implementation of Adaptive Digital Predistortion Systems

Lin Li*1, Peter Deaville1, Laurri Anttila2, Mikko Valkama 2, Adrian Sapio1, Marilyn Wolf3, Shuvra Bhattacharyya1,2 1University of Maryland College Park, USA 2Tampere University, Finland 3Georgia Institute of Technology, USA

Digital predistortion (DPD) has important applications in wireless communication for smart systems, such as, for example, in Internet of Things (IoT) applications for smart cities. DPD is used in wireless communication transmitters to counteract distortions that arise from nonlinearities, such as those related to amplifier characteristics and local oscillator leakage. In this paper, we propose an algorithm-architecture-integrated framework for design and implementation of adaptive DPD systems. The proposed framework provides energy-efficient, real-time DPD performance, and enables efficient reconfiguration of DPD architectures so that communication can be dynamically optimized based on time-varying communication requirements. Our adaptive DPD design framework applies Markov Decision Processes (MDPs) in novel ways to generate optimized runtime control policies for DPD systems. We present a GPU-based adaptive DPD system that is derived using our design framework, and demonstrate its efficiency through extensive experiments.

SS03.2 16:40-17:00 Reconfigurable Edge via Analytics Architecture

Shih-Yu Chen*1, Gwo Giun (Chris) Lee1, Tai-Ping Wang2, Chin-Wei Huang1, Jia-Hong Chen1, Chang-Ling Tsai3 1National Cheng Kung University, Taiwan 2ASE Group Inc., Taiwan 3University of Washington, USA

As artificial intelligence (AI) algorithms requiring high accuracy become exceedingly more complex and Edge/IoT generated data becomes increasingly bigger, flexible reconfigurable processing is crucial in the design of efficient smart edge systems requiring low power and is introduced in this paper. In AI, analytics algorithms are typically used to analyze speech, audio, image video data, etc. In current cross-level system design methodology different algorithmic realizations are analyzed in the form

Page 13: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

of dataflow graphs (DFG) to further increase efficiency and flexibility in constituting “analytics architecture”. Having information on both algorithmic behavior and architectural information including software and hardware, the DFG so introduced provides a mathematical representation which, as opposed to traditional linear difference equations, better models the underlying computational platform for systematic analysis thus providing flexible and efficient management of the computational and storage resources. In our analytics architecture work, parallel and reconfigurable computing are formulated via DFG which are analogous to the analysis and synthesis equations of the well-known Fourier transform pair. In parallel computing, a connected component is eigen-decomposed to unconnected components for concurrent processing. For computation resource saving, commonalities in DFGs are analyzed for reuse when synthesizing or reconfiguring the edge platform. In this paper, we specifically introduce lightweight edge upon which algorithmic convolution for Convolution Neural Network are eigen-transformed to matrix operations with higher symmetry which facilitates fewer operations, lower data transfer rate and storage anticipating lower power when synthesizing or reconfiguring the eigenvectors.

SS03.3 17:00-17:20

Improved Hybrid Memory Cube for Weight-Sharing Deep Convolutional Neural Networks

Hao Zhang, Jiongrui He, Seok-Bum Ko*

University of Saskatchewan, Canada

In recent years, many deep neural network accelerator architectures are proposed to improve the performance of processing deep neural network models. However, memory bandwidth is still the major issue and performance bottleneck of the deep neural network accelerators. The emerging 3D memory, such as hybrid memory cube (HMC) and processing-in-memory techniques provide new solutions to deep neural network implementation. In this paper, a novel HMC architecture is proposed for weight-sharing deep convolutional neural networks in order to solve the memory bandwidth bottleneck during the neural network implementation. The proposed HMC is designed based on conventional HMC architecture with only minor changes. In the logic layer, the vault controller is modified to enable parallel vault access. The weight parameters of pre-trained convolutional neural network are quantized to 16 numbers. During processing, the accumulation of the activations with shared weights is performed and only the accumulated results are transferred to the processing elements to perform multiplications with weights. By using this proposed architecture, the data transfer between main memory and processing elements can be reduced and the throughout of convolution operations can be improved by 30% compared to using HMC based multiply-accumulate design.

SS03.4 17:20-17:40 Function-Safe Vehicle AI Processor with Nano Core-in-Memory Architecture

Youngsu Kwon*, Jeongmin Yang, Yongcheol Peter Cho, Kyoung-Seon Shin, Jaehoon Chung, Jinho Han, Chun-Gi Lyuh, Hyun-Mi Kim, Chan Kim, Min-Seok Choi

AI Processor Research Group, Electronics and Telecommunications Research Institute, Korea

State-of-the-art neural network accelerators consist of arithmetic engines organized in a mesh structure datapath surrounded by memory blocks that provide neural data to the datapath. While server-based accelerators coupled with server-class processors are accommodated with large silicon area and consume large amounts of power, electronic control units in autonomous driving vehicles require power-optimized, 'AI processors' with a small footprint. An AI processor for mobile applications that integrates general-purpose processor cores with mesh-structured neural network accelerators and high speed memory while achieving high-performance with low-power and compact area constraints necessitates designing a novel AI processor architecture. We present the design of an AI processor for electronic systems in autonomous driving vehicles targeting not only CNN-based object recognition but also MLP-based in-vehicle voice recognition. The AI processor integrates Super-Thread-Cores (STC) for neural network acceleration with function-safe general purpose cores that satisfy vehicular electronics safety requirements. The STC is composed of 16384 programmable nano-cores organized in a mesh-grid structured datapath network. Designed based on thorough analysis of neural network computations, the nano-core-in-memory architecture enhances computation intensity of STC with efficient feeding of multi-dimensional activation and kernel data into the nano-cores. The quad function-safe general purpose cores ensure functional safety of Super-Thread-Core to comply with road vehicle safety standard ISO 26262. The AI processor exhibits 32 Tera FLOPS, enabling hyper real-time execution of CNN, RNN, and FCN.

SS03.5 17:40-18:00 Fast Detection of Objects Using a YOLOv3 Network for a Vending Machine

YOUHAK LEE*1, Chulhee Lee1, Jinsung Kim2, Hyuk-Jae Lee1 1Seoul National University, Korea 2Sunmoon University, Korea

Fast object detection is important to enable a vision-based automated vending machine. This paper proposes a new scheme to enhance the operation speed of YOLOv3 by removing the computation for the region of non-interest. In order to avoid the accuracy drop by a removal of computation, characteristics of a convolutional layer and a YOLO layer are investigated, and a new processing method is proposed from experimental results. As a result,

Page 14: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

the operation speed is increased in proportion to the size of the region of non-interest. Experimental results show that the speed is improved by 3.29 times while the accuracy degradation is 2.81% in mAP-50.

L5 Lecture Session 5 Deep Learning for Speech and Low-dimensional Signal Processing

Tuesday, March 19|16:20-18:00

Ballroom C, 10F

Chair(s):

Yin-Tsung Hwang, National Chung Hsing University, Taiwan

L5.1 16:20-16:40 Hyperdimensional Computing-based Multimodality Emotion Recognition with Physiological Signals

En-Jui Chang*1, Abbas Rahimi1, Luca Beninia2, An-Yeu (Andy) Wu3 1Integrated System Laboratory, ETH Zurich, Switzerland 2University of Bologna, Italy 3National Taiwan University, Taiwan

To interact naturally and achieve mutual sympathy between humans and machines, emotion recognition is one of the most important function to realize advanced human-computer interaction devices. Due to the high correlation between emotion and involuntary physiological changes, physiological signals are a prime candidate for emotion analysis. However, due to the need of a huge amount of training data for a high-quality machine learning model, computational complexity becomes a major bottleneck. To overcome this issue, brain-inspired hyperdimensional (HD) computing, an energy-efficient and fast learning computational paradigm, has a high potential to achieve a balance between accuracy and the amount of necessary training data. We propose an HD Computing-based Multimodality Emotion Recognition (HDC-MER). HDC-MER maps real-valued features to binary HD vectors using a random nonlinear function, and further encodes them over time, and fuses across different modalities including GSR, ECG, and EEG. The experimental results show that, compared to the best method using the full training data, HDC-MER achieves higher classification accuracy for both va- lence (83.2% vs. 80.1%) and arousal (70.1% vs. 68.4%) using only 1/4 training data. HDC-MER also achieves at least 5% higher averaged accuracy compared to all the other methods in any point along the learning curve.

L5.2 16:40-17:00 Design of Intelligent EEG System for Human Emotion Recognition with Convolutional Neural Network

Kai-Yen Wang*, Yun-Lung Ho, Yu-De Huang, Nicolas Fahier, Wai-Chi Fang

National Chiao Tung University, Taiwan

Emotions play a significant role in the field of affective computing and Human-Computer Interfaces(HCI). In this paper, we propose an intelligent human emotion detection system based on EEG features with a multi-channel fused processing. We also proposed an advanced convolutional neural network that was implemented in VLSI hardware design. This hardware design can accelerate both the training and classification processes and meet real-time system requirements for fast emotion detection. The performance of this design was validated using DEAP [1] database with datasets from 32 subjects, the mean classification accuracy achieved is 83.88%.

L5.3 17:00-17:20 Sparse Autoencoder with Attention Mechanism for Speech Emotion Recognition

Ting-Wei Sun, An-Yeu (Andy) Wu

National Taiwan University, Taiwan

There has been a lot of pervious works on speech emotion with machine learning method. However most of them rely on the effectiveness of labelled speech data. In this paper, we propose a novel algorithm which combine both sparse autoencoder and attention mechanism. The aim is to benefit form labeled and unlabeled data with autoencoder, and apply attention mechanism to focus on speech frames which have strong emotionally information. We can also ignore other speech frames do not carry emotional content. The proposed algorithm is evaluated on three public databases with cross-language system. Experimental results show that the proposed algorithm provide significantly higher accurate predictions compare to existing speech emotion recognition algorithms.

L5.4 17:20-17:40 A Pruned-CELP Speech Codec Using Denoising Autoencoder with Spectral Compensation for Quality and Intelligibility Enhancement

Yu-Ting Lo1, Syu-siang Wang2, Yu Tsao2, Sheng-Yu Peng1* 1National Taiwan University of Science and Technology, Taiwan 2Academia Sinica, Taiwan

A codec based on the excited linear prediction (CELP) speech compression method adopting a denoising autoencoder with spectral compensation (DAE-SC) for quality and intelligibility enhancement is proposed in this paper. The sizes of CELP parameters in the encoder are carefully pruned to achieve a higher compression rate. To recover the speech quality and intelligibility degradation

Page 15: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

due to the pruned CELP parameters, a DAE-SC network with three hidden layers is employed in the decoder.Compared with the conventional CELP codec at a 9.6k bps transmission rate, the proposed speech codec achieves extra 21.9% bit rate reduction with comparable speech quality and intelligibility that are evaluated by four commonly used speech performance metrics.

L5.5 17:40-18:00 An Enhanced MUSIC DoA Scanning Scheme for Array Radar Sensing in Autonomous Movers

Kuang-Ying Chang, Kuan-Ting Chen, Wei-Hsuan Ma*, Yin-Tsung Hwang

National Chung Hsing University, Taiwan

In this paper, we present an enhanced MUltiple SIgnal Classification (MUSIC) scheme for Direction of Arrival (DoA) scanning using a linear antenna array system. The goal is to construct an obstruction map based on the DoA scanning results for an autonomous mover when navigating in a pedestrian rich environment. A low complexity DoA estimation scheme, which eliminates the requirement of a computationally expensive Eigen Decomposition (ED) in conventional MUSIC algorithm, is developed. An Orthogonal Projection Matrix (OPM) scheme is used. Furthermore, a QR decomposition method is employed to implement the pseudo inverse matrix calculation required in the OPM scheme. This leads to a very computing efficient approach and facilitates real time implementation in hardware accelerators. The simulation results show that the proposed scheme can perform comparably to the conventional scheme at a much lower computing complexity.

KN4 Keynote 4

Wednesday, March 20|08:30-09:30

Ballroom B, 10F

Chair(s):

Shao-Yi Chien, National Taiwan University, Taiwan

Edge Intelligence for Optimized Systems & High-Performance Devices

Anthony Vetro

MERL, USA

SF Special Session/Forum 2018 Low-Power Image Recognition Challenge and Beyond

Wednesday, March 20|09:30-10:30 Ballroom B, 10F Chair(s): Yen-Kuang Chen, Alibaba USA, formerly a Principal Research Scientist of Intel

SF.1 09:30-09:50 2018 Low-Power Image Recognition Challenge and Beyond

Matthew Ardi1, Alexander Berg2, Bo Chen3, Yen-Kuang Chen4, Yiran Chen5, Donghyun Kang6, Junhyeok Lee8, Seungjae Lee9, Yang Lu7, Yung-Hsiang Lu*1, Fei Sun7 1Purdue University; 2University of North Carolina, USA 3Google; 4Intel, USA 5Duke University; 6Seoul National University 7Facebook; 8KPST; 9ETRI

The IEEE Low-Power Image Recognition Challenge (LPIRC) is an annual competition started in 2015. The competition identifies the best technologies that can detect objects in images efficiently (short execution time and low energy consumption). This paper summarizes LPIRC in year 2018 by describing the winners’ solutions. The paper also discusses the future of low-power computer vision.

SF.2 09:50-10:10 (Invited Talk) Efficient Object Detection for LPIRC

Seungjae Lee1, Junhyeok Lee2 1ETRI; 2KPST

SF.3 10:10-10:30 (Invited Talk) Software Optimization-aware Network Selection for Image Recognition on the NVIDIA Jetson TX2 Board

Donghyun Kang

Seoul National University

Page 16: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

L6 Lecture Session 6 Medical AI (I)

Wednesday, March 20|09:30-10:30

Ballroom C, 10F

Chair(s):

Yuan-Hao Huang, National Tsing Hua University, Taiwan

L6.1 09:30-09:50 Novel Sleep Apnea Detection Based on UWB Artificial Intelligence Mattress

Chiapin Wang*1,4, Jen-Hau Chan1, Shih-Hau Fang2,4, Ho-Ti Cheng2, Yeh-Liang Hsu3 1National Taiwan Normal University, Taiwan 2Department of Electrical Engineering, Yuan Ze University, Taiwan 3Department of Mechanical Engineering and Gerontechnology Research Center, Yuan Ze University, Taiwan 4MOST Joint Research Center for AI Technology and All Vista Healthcare, Taiwan

In this paper, we propose a novel sleep apnea identification system by adopting a sleep breathing monitoring mattress which utilizes the ultra-wideband (UWB) physiological sensing technique. Unlike traditional methods which need wearable devices and electrical equipment connected to patients, the proposed system detects apnea in a non-conscious and non-contact way by using UWB sensors. The proposed system is built by a machine learning technique in the offline stage, and detects apnea in the online stage by using our designed apnea detection algorithm. The experimental results illustrate that the proposed apnea identification system efficiently detects sleep apnea without diagnoses undertaken at hospitals.

L6.2 09:50-10:10 Machine Learning Based Sleep-Status Discrimination Using a Motion Sensing Mattress

Chiapin Wang*1,4, Tsung-Yi Fan Chian1, Shih-Hau Fan2,4, Chieh-Ju Li3, Yeh-Liang Hsu3 1National Taiwan Normal University, Taiwan 2Department of Electrical Engineering, Yuan Ze University, Taiwan 3Department of Mechanical Engineering and Gerontechnology Research Center, Yuan Ze University, Taiwan 4MOST Joint Research Center for AI Technology and All Vista Healthcare, Taiwan

This paper presents a novel sleep-status discrimination system by adopting a motion sensing mattress which detects the user’s activities on bed including the movement of head, chest, legs and feet. Unlike traditional methods like Polysomnography (PSG) which needs electrical equipment connected to users, or like wrist actigraphy which needs to be contact to users, the

proposed system distinguishes sleep states in a non-conscious and non-contact way. The proposed system is built by a machine learning technique in the offline stage, and distinguishes sleep states in the online stage by using our designed sleep-status discrimination algorithm. The experimental results illustrate that the proposed method efficiently distinguishes sleep statuses without using a wearable device contact to body or using PSG diagnosis undertaken at hospitals.

L6.3 10:10-10:30 Epilepsy Identification System with Neural Network Hardware Implementation

Chieh Tsou, Chi-Chung Liao, Shuenn-Yuh Lee*

National Cheng-Kung University, Taiwan

This paper presents a real-time identification system for epilepsy detection with a neural network (NN) classifier. The identification flow of the proposed system in animal testing is described as follows: 1. Two channel signals are collected from mouse brain. 2. Original signals are filtered in the appropriate bandwidth. 3. Six feature values are calculated. 4. Normal and epilepsy are classified by the classifier. The electroencephalography signal is measured from C57BL/6 mice in animal testing with a sampling rate of 400 Hz. The proposed system is verified on software design and hardware implementation. The software is designed in Matlab, and the hardware is implemented by the field programmable gate array (FPGA) platform. The chip is fabricated with TSMC 0.18 μm CMOS technology. The feature extraction function is realized in FPGA, and the NN architecture is implemented with a chip. The chosen feature sets from the previous measured animal testing data are amplitude, frequency bins, approximate entropy, and standard deviation. The accuracies of the proposed system are approximately 98.76% and 89.88% on software verification and hardware implementation, respectively. Results reveal that the proposed architecture is effective for epilepsy recognition.

Industrial Session Embedded AI computing platform

Organizers:

Jiun-In Guo, National Chiao Tung University, Taiwan

Rajiv Joshi, IBM Research TJ Watson, USA

Kyomin Sohn, Samsung Electronics, Korea

Abstract: This industrial session aims to discuss the

emerging technology for embedded AI computing

platform from industry. We are entering an era of

embedded AI, where real-time computing

Page 17: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

technology and platform for AI deep learning will

become more and more important for a variety of

real-time applications. This industrial session

consists of two parts. The first part will discuss the

technology about AI computing platform, including

NeuroPilot, a cross-platform framework for edge AI,

large model support for deep learning in caffe and

Chainer, and a multi-task ADAS system on FPGA. The

second part focuses on the compiler technologies for

AI chip, including Autopiler, an AI based framework

for program auto-tuning and options

recommendation and ONNC, a compilation

framework connecting ONNX to proprietary deep

learning accelerators.

IN01 Industrial Session 1 AI Computing Platform

Wednesday, March 20|09:30-10:30

Ballroom D, 11F

Chair(s):

Jiun-In Guo, National Chiao-Tung University, Taiwan

IN01.1 09:30-09:50 NeuroPilot: A Cross-Platform Framework for Edge-AI

Tung-Chien Chen*, Wei-Ting Wang, Kloze Kao, Chia-Lin Yu, Code Lin, Shu-Hsin Chang, Pei-Kuei Tsung

MediaTek Inc.

Artificial intelligence (AI) has been applied from cloud servers to edge devices because of its rapid response, privacy, robustness, and the efficient use of network bandwidth. However, it is challengeable to deploy the computation and memory-bandwidth intensive AI to edge devices for the power and hardware resource are limited. The various needs of applications, diverse devices and the fragmented supporting tools make the integration a tough work. In this paper, the NeuroPilot, a cross-platform framework for edge AI, is introduced. Technologies on software, hardware and integration levels are proposed to achieve the high performance and preserve the flexibility meanwhile. The NeuroPilot solution provides the superior edge AI ability for a wide range of applications.

IN01.2 09:50-10:10 Large Model Support for Deep Learning in Caffe and Chainer

Minsik Cho, Tung D. Le, Ulrich A. Finkler, Haruiki Imai, Yasushi Negishi, Taro Sekiyama, Saritha Vinod, Vladimir Zolotov, Kiyokuni Kawachiya, David S. Kung, and Hillery C. Hunter

IBM T. J. Watson Research Center; IBM Tokyo Research Lab

Deep learning is both compute- and data-intense, and recent breakthroughs have largely been fueled by the fp32 compute capacity of modern GPUs. This has made GPUs the prevalent tool for training deep neural networks, but GPUs have only small amounts of costly

3D-stacked HBM DRAM as their local memory. Working out of a small memory imposes a limit on the maximum learning capacity a neural network can have (i.e., the number of learnable parameters) and the maximum size and number of samples a network can consume at a given time. The field of deep learning is evolving in many new directions, and research teams are exploring both very large neural networks and attempting to apply deep learning to real datasets, including high-resolution images. Those exploring both the boundaries of neural networks and use of real datasets today generally will find that their deep learning software won’t support what they wish to train, and if it does, they find performance to be intolerably slow. In this paper, we present the idea of large model support, and its implementation in two popular deep learning frameworks, Caffe and Chainer. The key idea is to use GPU memory as an application-level cache w.r.t. the host memory so that a large network (e.g., many parameters or many layers) can be trained with real-world samples (e.g., HD-images). Although our large model support scheme may degrade the performance of training due to the communication overhead between the system CPUs and GPUs, the overhead in general is observed to reduce significantly with the use of a faster communication link between the CPU and GPU (NVLink and next-Gen NVLink). Our experimental results show that our large model support in Caffe and Chainer performs very well, and can train 2 to 6 times larger ImageNet models.

IN01.3 10:10-10:30 Multi-task ADAS system on FPGA

Jinzhang Peng1, Lu Tian*2,1, Xijie Jia1, Haotian Guo1, Yongsheng Xu1, Dongliang Xie1, Hong Luo1, Yi Shan1, Yu Wang2 1Xilinx,Inc. 2Department of Electronic Engineering, Tsinghua University

Advanced Driver-Assistance Systems (ADAS) can help drivers in the driving process and increase the driving safety by automatically detecting objects, doing basic classification, implementing safeguards, etc. ADAS integrate multiple subsystems including object detection, scene segmentation, lane detection, and so on. Most algorithms are now designed for one specific task, while

Page 18: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

such separate approaches will be inefficient in ADAS which consists of many modules. In this paper, we establish a multi-task learning framework for lane detection, semantic segmentation, 2D object detection, and orientation prediction on FPGA. The performance on FPGA is optimized by software and hardware co-design. The system deployed on Xilinx zu9 board achieves 55 FPS, which meets real-time processing requirement.

SS04 Special Session 4 Intelligent processing of time-series signals

Wednesday, March 20|10:50-12:10

Ballroom B, 10F

Chair(s):

Guoxing Wang, Shanghai Jiao Tong University, China

SS04.1 10:50-11:10 Classification of Cardiac Arrhythmias Based on Artificial Neural Networks and Continuous-in-Time Discrete-in-Amplitude Signal Flow

Yang Zhao*, Simon Lin, Zhongxia Shang, Yong Lian

EECS, Lassonde School of York University

Conventional Artificial Neural Networks (ANNs) for classification of cardiac arrhythmias are based on Nyquist sampled electrocardiogram (ECG) signals. The uniform sampling scheme introduces large redundancy in the ANN, which results high power and large silicon area. To address these issues, we propose to use continuous-in-time discrete-in-amplitude (CTDA) sampling scheme as the input of the network. The CTDA sampling scheme significantly reduces the sample points on the baseline part while provides more detail on useful features in the ECG signal. It is shown that the CTDA sampling scheme achieves significant savings on arithmetic operations in the ANN while maintains the similar performance as Nyquist sampling in the classification. The proposed method is evaluated by MIT-BIH arrhythmia database following AAMI recommended practice.

SS04.2 11:10-11:30 Improved Convolutional Neutral Network Based Detector Model for Small Visual Object Detection in Autonomous Driving

Shijin Song*1, Yongxin Zhu1,2, Junjie Hou1, Yu Zheng1, Tian Huang 3, Sen Du1 1School of Microelectronics, Shanghai Jiao Tong University, China 2Shanghai Advanced Research Institute, Chinese Academy of Sciences, China 3University of Cambridge, United Kingdom

As the killer application of artificial intelligence,

autonomous driving is making fundamental transformations to the transportation industry. Computer vision based on deep learning is among the enabling technologies. However, small objects around vehicles are difficult to detect because of poor visual features within small objects as well as insufficient valid samples of small objections. In this paper, we propose an end-to-end detector model based on convolutional neutral network (CNN) to enhance visual features of small traffic signs in real scenarios. With those enhanced features, we manage to obtain an efficient inference model after training. We further make preliminary comparison with Fast R-CNN and Faster R-CNN models. Experimental results indicate that our model outperforms the others by more than 10% improvement in terms of accuracy and recall.

SS04.3 11:30-11:50 Accelerating CNN-RNN Based Machine Health Monitoring on FPGA

Xiaoyu Feng*, Jinshan Yue, Qingwei Guo, Huazhong Yang, Yongpan Liu

Tsinghua University, China

Emerging artificial intelligence brings new opportunities for embedded machine health monitoring systems. However, previous work mainly focus on algorithm improvement and ignore the software-hardware co-design. This paper proposes a CNN-RNN algorithm for remaining useful life (RUL) prediction, with hardware optimization for practical deployment. The CNN-RNN algorithm combines the feature extraction ability of CNN and the sequential processing ability of RNN, which shows 23\%-53\% improvement on the CMAPSS dataset. This algorithm also considers hardware implementation overhead and an FPGA based accelerator is developed. The accelerator adopts kernel-optimized design to utilize data reuse and reduce memory accesses. It enables real-time response and 5.89GOPs/W energy efficiency within small size and cost overhead. The FPGA implementation shows 15x CNN speedup and 9x overall speedup compared with the embedded processor Cortex-A9.

SS04.4 11:50-12:10 Heart Rate Estimation from Ballistocardiogram Using Hilbert Transform and Viterbi Decoding

Qingsong Xie, Yongfu Li, Guoxing Wang*, Yong Lian

Shanghai Jiao Tong University, China

This paper presents a robust algorithm to estimate heart rate (HR) from ballistocardiogram (BCG). The BCG signal can be easily acquired from the vibration or force sensor embedded in a chair or a mattress without any electrode attached to body. The algorithm employs the Hilbert Transform to reveal the frequency content of J-peak in BCG signal. The Viterbi decoding (VD) is used to estimate HR by finding the most likely path through time-frequency

Page 19: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

state-space plane. The performance of the proposed algorithm is evaluated by BCG recordings from 10 subjects. Mean absolute error (MAE) of 1.35 beats per minute (BPM) and standard deviation of absolute error (STD) of 1.99 BPM are obtained. Pearson correlation coefficient between estimated HR and true HR of 0.94 is also achieved.

L7 Lecture Session 7 Medical AI (II)

Wednesday, March 20|10:50-12:10

Ballroom C, 10F

Chair(s):

Mohamad Sawan, Westlake University, China

L7.1 10:50-11:10 Automatic HCC Detection Using Convolutional Network with Multi-Magnification Input Images

Wei-Che Huang1, Pau-Choo Chung1, Hung-Wen Tsai2, Nan-Haw Chow3, Ying-Zong Juang4, Cheng-Hsiung Wang*4, Hann-Huei Tsai4, Shih-Hsuan Lin1 1National Cheng Kung University, Taiwan 2Department of Pathology, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Taiwan 3College of Medicine, National Cheng Kung University, Taiwan 4Taiwan Semiconductor Research Institute, National Applied Research Laboratories, Taiwan

Liver cancer postoperative pathologic examination of stained tissues is an important step in identifying prognostic factors for follow-up care. Traditionally, liver cancer detection would be performed by pathologists with observing the entire biological tissue, resulting in heavy work loading and potential misjudgment. Accordingly, the studies of the automatic pathological examination have been popular for a long period of time. Most approaches of the existing cancer detection, however, only extract cell level information based on single-scale high-magnification patch. In liver tissues, common cell change phenomena such as apoptosis, necrosis, and steatosis are similar in tumor and benign. Hence, the detection may fail when the patch only covered the changed cells area that cannot provide enough neighboring cell structure information. To conquer this problem, the convolutional network architecture with multi-magnification input can provide not only the cell level information by referencing high-magnification patches, but also the cell structure information by referencing low-magnification patches. The detection algorithm consists of two main structures: 1) extraction of cell level and cell structure level feature maps from high-magnification and low-magnification images respectively by separate general convolutional networks, and 2) integration of multi-magnification

features by fully connected network. In this paper, VGG16 and Inception V4 were applied as the based convolutional network for liver tumor detection task. The experimental results showed that VGG16 based multi-magnification input convolutional network achieved 91% mIOU on HCC tumor detection task. In addition, with comparison between single-scale CNN (SSCN) and multi-scale CNN (MSCN) approaches, the MSCN demonstrated that the multi-scale patches could provide better performance on HCC classification task.

L7.2 11:10-11:30 Using a Cropping Technique or Not: Impacts on SVM-based AMD Detection on OCT Images

Cheng-En Ko*1, Po-Han Chen1, Wei-Ming Liao1, Cheng-Kai Lu2, Cheng-Hung Lin1, Jing-Wen Liang1 1Yuan Ze University, Taiwan 2Universiti Teknologi PETRONAS, Malaysia

This paper compares the system performance of the distinct flows with and without automatic image cropping for age-related macular degeneration (AMD) detection on optical coherence tomography (OCT) images. Using the image cropping, the computational time of noise removal and feature extraction can be significantly reduced with a small loss of detection accuracy. The simulation results show that using the image cropping at the first stage achieves 93.4% accuracy. Compared to the flow without image cropping, using the image cropping loses only 0.5% accuracy but saves about 12 hours computational time and about a half of memory storages.

L7.3 11:30-11:50

AI-Based Edge-Intelligent Hypoglycemia Prediction System Using Alternate Learning and Inference Method for Blood Glucose Level Data with Low-periodicity

Tran Minh Quan1, Takuyoshi Doike1, Dang Cong Bui1, Kenya hayashi1, Shigeki Arata1, Atsuki Kobayashi1, Md. Zahidul Islam1, Kiichi Niitsu*1, 2 1Nagoya University, Japan

2PRESTO, JST, Japan

In this study, we developed an AI-based edge-intelligent hypoglycemia prediction system for environment with low-periodic blood glucose level. By using long short-term memory (LSTM), a specialized network for handling time series data among neural networks along with introducing alternate learning and inference, it was possible to predict the BG level with high accuracy. To this end, the system for predicting BG level was created using LSTM, and the performance of the system was evaluated using the method of classification problem. The system successfully predicted the probability of occurrence of hypoglycemia after 30 min approximately 80% times. Furthermore, it was demonstrated that accuracy is improved by alternately performing learning and prediction.

Page 20: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

L7.4 11:50-12:10 A Deep Learning Based Wearable Medicines Recognition System for Visually Impaired People

Wan-Jung Chang1,2, Yue-Xun Yu1, Jhen-Hao Chen1, Zhi-Yao Zhang1, Sung-Jie Ko1, Tsung-Han Yang1, Chia-Hao Hsu1,2, Liang-Bi Chen*1,2, Ming-Che Chen2,1 1Southern Taiwan University of Science and Technology, Taiwan 2Artificial Intelligence over Internet of Things Applied Research Center (AIoT Center), Southern Taiwan University of Science and Technology, Taiwan

This paper proposes a deep learning based wearable medicines recognition system for visually impaired people. The proposed system is composed of a pair of wearable smart glasses, a wearable waist-mounted drug pills recognition device, a mobile device application, and a cloud-based management platform. The proposed system uses deep learning technology to identify drug pills to avoid taking wrong drugs. The experimental results show that the accuracy of the proposed system has reached up to 90% that can really be achieved the purpose of correct medication for visually impaired people.

IN02 Industrial Session 2 Compiler Technology for AI Chip

Wednesday, March 20|10:50-12:10

Ballroom D, 11F

Chair(s):

Jiun-In Guo, National Chiao-Tung University, Taiwan

IN02.1 10:50-11:10 Autopiler: An AI Based Framework for Program Autotuning and Options Recommendation

Kang-Lin Wang, Chi-Bang Kuan, Jiann-Fuh Liaw, Wei-Liang Kuo

MediaTek Inc.

Program autotuning has been proved to achieve great performance improvement in various domains. Many autotuning frameworks have been provided to support fully customizable configuration representations, a wide variety of representations for domain-specific tuning, and a user friendly interface for interaction between the program and the autotuner. However, tuning programs takes time, no matter it is autotuned or manually tuned. Oftentimes, programmers don’t have the time waiting for autotuners to finish and want to have rather good options to use instantly. This paper introduces Autopiler, a framework for building non-domain-specific multi-objective program autotuners with machine learning based recommender systems for options prediction. This framework supports not only non-domain-specific tuning techniques, but also learns from previous tuning results and can make adequate good options recommendation before any tuning happens. We will illustrate the

architecture of Autopiler and how to leverage recommender system for compiler options recommendation, in such way Autopiler can learn from the programs and becomes an AI boosted smart compiler. The experiment results show that Autopiler can deliver up to 19.46% performance improvement for in-house 4G LTE modem workloads.

IN02.2 11:10-11:30 ONNC: A Compilation Framework Connecting ONNX to Proprietary Deep Learning Accelerators

Wei-Fen Lin*, Der-Yu Tsai, Luba Tang, Cheng-Tao Hsieh, Cheng-Yi Chou, Ping-Hao Chang, Luis Hsu

Skymizer Taiwan Inc.

This paper presents ONNC (Open Neural Network Compiler), a retargetable compilation framework designed to connect ONNX (Open Neural Network Exchange) models to proprietary deep learning accelerators (DLAs). The intermediate representations (IRs) of ONNC have one-to-one mapping to ONNX IRs, thus making porting ONNC to proprietary DLAs much simpler than other compilation frameworks such as TVM and Glow especially for hardware with coarse-grained operators that are not part of the generic IRs in the LLVM backend. ONNC also has a flexible pass manager designed to support compiler optimizations at all levels. A docker image of ONNC bundled with a Vanilla backend is released with this paper to enable fast porting to new hardware targets. To illustrate how an ONNC-based toolkit guides our research and development in DLA design, we present a case study on compiler optimizations for activation memory consumption. The study shows that the Best-Fit algorithm with a proposed heuristic and a reordering scheme may act as a near optimal strategy, getting the memory consumption close to the ideal lower bound in 11 of 12 models from the ONNX model zoo. To our best knowledge, ONNC is the first open source compilation framework that is specially designed to support the ONNX-based models for both commercial and research projects for deep learning applications.

Page 21: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

P1 Poster Session 1 Applications of Deep Neural Network

Wednesday, March 20|13:10-14:40

Ballroom A, 10F

Chair(s):

Ching-Hwa Cheng, Feng Chia University, Taiwan

P1.1 Flyintel – a Platform for Robot Navigation based on a Brain-Inspired Spiking Neural Network

Huang-Yu Yao*, Hsuan-Pei Huang, Yu-Chi Huang, Chung-Chuan Lo

National Tsing Hua University, Taiwan

Spiking neural networks (SNN) are regarded by many as the “third generation network” that will solve computation problems in a more biologically realistic way. In our project, we design a robotic platform controlled by a user-defined SNN in order to develop a next generation artificial intelligence robot with high flexibility. This paper describes the preliminary progress of the project. We first implement a basic simple decision network and the robot is able to perform a basic but vital foraging and risk-avoiding task. Next, we implement the neural network of the fruit fly central complex in order to endow the robot with spatial orientation memory, a crucial function underlying the ability of spatial navigation.

P1.2 A Learnable Unmanned Smart Logistics Prototype System Design and Implementation

I-Lok Cheng1, Ching-Hwa Cheng*2, Don-Gey Liu2 1GMT Global Inc. 2Department of Electronics of Feng Chia University, Taiwan

Most of today's logistic systems require people to control them. If there are no enough man-power, e.g. drivers, or the destination is unfamiliar by the driver, delivery could be delayed or the goods may send to the wrong location. This paper demonstrated a prototype of a learnable smart system for precise positioning of unmanned transport machines. The proposed system consists of robotic arms, land vehicles, and unmanned aerial vehicles, which can easily deliver light-cargo to a designated place. The proposed design can automatically deliver goods to designated locations while avoiding environmental influences. Interactive use of unmanned vehicles and unmanned aerial vehicles for transport makes it possible to transport goods to a precise destination. This learnable prototype system can be demonstrated to evaluate the feasibility and performance for a learnable unmanned intelligent transportation system.

P1.3 On Automatic Generation of Training Images for Machine Learning in Automotive Applications

Tong-Yu Hsieh*, Yuan-Cheng Lin, Hsin-Yung Shen

National Sun Yat-sen University, Taiwan

Machine learning is expected to play an important role in implementing automotive systems such as the Advanced Driver Assistance Systems (ADAS). To make machine learning methods work well, providing a sufficient number of training data is very important. However, collecting the training data may be difficult or very timing-consuming. In this paper we investigate automatic generation of training data for automotive applications. The Generative Adversarial Network (GAN) techniques are employed to generate fake yet still high-quality data for machine learning. Although using GAN to generate training images has been proposed in the literature, the previous work does not consider automotive applications. In this work a case study on vehicle detection is provided to demonstrate powerfulness of GAN and the effectiveness of the generated training images by GAN. The generated fake bus images are employed as training data and a SVM (Support Vector Machine) method is implemented to detect buses. The results show that the SVM trained by the fake images achieves almost the same detection accuracy as that by real images. The result also shows that GAN can generate the training images very fast. The extension of GAN to generate road images with various weather conditions such as fogs or nights is also discussed.

P1.4 Online Anomaly Detection in HPC Systems

Andrea Borghesi*1, Antonio Libri2, Luca Benini2, Andrea Bartolini1 1University of Bologna, Italy 2IIS, ETHZ, Zurich, Switzerland

Reliability is a cumbersome problem in High- Performance Computing Systems’ evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to future generation Exascale supercomputers: automated methods to detect faults and unhealthy conditions is needed. In this paper we propose to solve this issue by combining artificial intelligence, big data and edge computing approaches for automated detection in HPC systems. Our method uses a type of neural network called autoncoder trained to learn the normal behavior of a real, in-production supercomputer (identifying unhealthy ones) and deployed on the edge on each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.

Page 22: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

P2 Poster Session 2 Algorithms and Architectures for Neural Networks

Wednesday, March 20|13:10-14:40

Ballroom A, 10F

Chair(s):

Wai-Chi Fang, National Chiao Tung University, Taiwan

P2.1 Configurable Texture Unit for Convolutional Neural Networks on Graphics Processing Units

Yi-Hsiang Chen*, Shao-Yi Chien

National Taiwan University, Taiwan

To accelerate Convolutional Neural Networks (CNN) operations on resource-limited mobile graphics processing units (GPUs), taking advantage of the common characteristics between texture filtering and convolutional layer, we propose a configurable texture unit called tensor and texture unit (TTU) to offload the computation from shader cores. With adding a new datapath for loading weight parameters in the texture unit, reusing the original texture cache, increasing the flexibility of the filtering unit, and packing the input data and weight parameters to fixed-point format, we make the texture unit be able to support convolutional and pooling layers with only small modifications. The proposed architecture is verified by integrating TTU into a GPU system in RTL level. Experimental results show that 18.54x speedup can be achieved with the overhead of only 8.5% compared with a GPU system with a traditional texture unit.

P2.2 Implementation of STDP Learning for Non-volatile Memory-based Spiking Neural Network using Comparator Metastability

Sang-Gyun Gi*, Injune Yeo, Byung-geun Lee

Gwangju Institute of Science and Technology, South Korea

This paper presents a circuit for spike-timing dependent plasticity (STDP) learning of a non-volatile memory (NVM) based spiking neural network (SNN). Unlike conventional hardware for implementation of STDP learning, the proposed circuit does not require additional memory, amplifiers, or an STDP spike generator. Instead, the circuit utilizes the metastability time information of the dynamic comparator to implement a non-linear transfer curve of STDP learning. The circuit includes a dynamic comparator, NVM device, and some digital circuitry to write the conductance of NVM according to the STDP learning rule. Finally, the conductance response model and designed circuit for the STDP learning are used to compare the simulation results of STDP with mathematical STDP. Applications of the proposed circuit are in the design of

NVM-based SNN hardware or other bio-inspired hardware systems.

P2.3 Heterogeneous activation function extraction for training and optimization of SNN systems

Amir Zjajo*, Sumeet Kumar, Rene van Leuken

Delft University of Technology, The Netherlands

Energy-efficiency and computation capability characteristics of analog/mixed-signal spiking neural networks offer capable platform for implementation of cognitive tasks on resource-limited embedded platforms. However, inherent mismatch in analog devices severely influence accuracy and reliability of the computing system. In this paper, we devise efficient algorithm for extracting of heterogeneous activation functions of analog hardware neurons as a set of constraints in an off-line training and optimization process, and examine how compensation of the mismatch effects influence synchronicity and information processing capabilities of the system.

P2.4 Performance Trade-offs in Weight Quantization for Memory-Efficient Inference

Pablo M. Tostado, Bruno U. Pedroni, Gert Cauwenberghs

University of California San Diego, USA

Abstract—Over the past decade, Deep Neural Networks(DNNs) trained using Deep Learnin (DL) frameworks have become the workhorse to solve a wid variety of computational tasks in big data environments. To date, DL DNNs have relied on large amounts of computational power to reach peak performance, typically relying on the high computational bandwidth ofGPUs, while straining available memory bandwidth and capacity.With ever increasing data complexity and more stringent energy constraints in Internet-of-Things (IoT) application environments, there has been a growing interest in the development of more efficient DNN inference methods that economize on random-access memory usage in weight access. Herein, we present a systematic analysis of the performance trade-offs of quantized weight representations at variable bit length for memory-efficient inference in pre-trained DNN models. In this work, we vary the mantissa and exponent bit lengths in the representationof the network parameters and examine the effect of DropOut regularization during pre-training and the impact of two different weight truncation mechanisms: stochastic and deterministic rounding. We show drastic reductio in the memory need, down to 4 bits per weight, while maintaining near-optimal test performance of low-complexity DNNs pre-traine on the MNIST and CIFAR-10 datasets. These results offer a simple methodology to achieve high memory and computation

Page 23: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

efficiency of inference in DNN dedicated low-power hardware for IoT, directly from pre-trained, high-resolution DNNs using standard DL algorithms.

P2.5 Elastic Neural Networks for Classification

Yi Zhou*1, Yue Bai1, Shuvra S. Bhattacharyya1,2, Heikki Huttunen1 1Tampere University of Technology, Finland 2University of Maryland, USA

In this work we propose a framework for improving the performance of any deep neural network that may suffer from vanishing gradients. To address the vanishing gradient issue, we study a framework, where we insert an intermediate output branch after each layer in the computational graph and use the corresponding prediction loss for feeding the gradient to the early layers. The framework---which we name Elastic network---is tested with several well-known networks on CIFAR10 and CIFAR100 datasets, and the experimental results show that the proposed framework improves the accuracy on both shallow networks (e.g., MobileNet) and deep convolutional neural networks (e.g., DenseNet). We also identify the types of networks where the framework does not improve the performance and discuss the reasons. Finally, as a side product, the computational complexity of the resulting networks can be adjusted in an elastic manner by selecting the output branch according to current computational budget.

P2.6 Optimizations of Scatter Network for Sparse CNN Accelerators

Sunwoo Kim1, Chungman Lee1, Haesung Park1, Jooho Wang1, Sungkyung Park2, Chester Sungchung Park*1 1Konkuk University, Korea 2Pusan National University, Korea

Sparse CNN (SCNN) accelerators tend to suffer from the bus contention of its scatter network. This paper considers the optimizations of the scatter network. Several network topologies and arbitration algorithms are evaluated in terms of performance and area.

P2.7 Fast Convolution Algorithm for Convolutional Neural Networks

Tae Sun Kim, JiHoon Bae, Myung Hoon Sunwoo*

Ajou University, Korea

Recent advances in computing power made possible by developments of faster general-purpose graphics processing units (GPGPUs) have increased the complexity of convolutional neural network (CNN) models. However,

because of the limited applications of the existing GPGPUs, CNN accelerators are becoming more important. The current accelerators focus on improvement in memory scheduling and architectures. Thus, the number of multiplier-accumulator (MAC) operations is not reduced. In this study, a new convolution layer operation algorithm is proposed using the coarse-to-fine method instead of hardware or architecture approaches. This algorithm is shown to reduce the MAC operations by 33%. However, the accuracy of the Top 1 is decreased only by 3% and the Top 5 only by 1%.

SS05 Special Session 5 Emerging Memory Technologies for Neuromorphic Circuits and Systems

Wednesday, March 20|14:40-16:00

Ballroom B, 10F

Chair(s):

Jason Eshraghian, University of Western Australia, Australia

Alex James, Nazarbayev University, Kazakhstan

SS05.1 14:40-15:00 AnalogHTM: Memristive Spatial Pooler Learning with Backpropagation

Olga Krestinskaya, Alex Pappachen James*

Nazarbayev University, Kazakhstan

Spatial pooler is responsible for feature extraction in Hierarchical Temporal Memory (HTM). In this paper, we present analog backpropagation learning circuits integrated to the memristive circuit design of spatial pooler. Using 0.18 um CMOS technology and TiOx memristor models, the maximum on-chip area and power consumption of the proposed design are 8335.074 um^2 and 51.55 mW, respectively. The system is tested for a face recognition problem AR face database achieving a recognition accuracy of 90%.

SS05.2 15:00-15:20 Analog Weights in ReRAM DNN Accelerators

Jason Eshraghian*1, Sung-Mo Kang2, Seungbum Baek3, Garrick Orchard4, Herbert Ho-Ching Iu1, Wen Lei1 1University of Western Australia, Australia 2University of California, USA 3Chungbuk National University, Korea 4National University of Singapore, Singapore

Artificial neural networks have become ubiquitous in modern life, which has triggered the emergence of a new class of application specific integrated circuits for their acceleration. ReRAM-based accelerators have gained significant traction due to their ability to leverage in-

Page 24: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

memory computations. In a crossbar structure, they can perform multiple-and-accumulate operations more efficiently than standard CMOS logic. By virtue of being resistive switches, ReRAM switches can only reliably store one of two states. This is a severe limitation on the range of values in a computational kernel. this paper presents a novel scheme in alleviating the single-bit-per-device restriction by exploiting frequency dependence of v-i plane hysteresis, and assigning kernel information not only to the device conductance but also partially distributing it to the frequency of a time-varying input. We show this approach reduces average power consumption for a single crossbar convolution by up to a factor of x16 for an unsigned 8-bit input image, where each convolutional process consumes a worst-case of 1.1mW, and reduces area by a factor of x8, without reducing accuracy to the level of binarized neural networks. This presents a massive saving in computing cost when there are many simultaneous in-situ multiply-and-accumulate processes occurring across different crossbars.

SS05.3 15:20-15:40 AMSNet: Analog Memristive System Architecture for Mean-Pooling with Dropout Convolutional Neural Network

Olga Krestinskaya, Adilya Bakambekova, Alex Pappachen James*

Nazarbayev University, Kazakhstan

This work proposes analog hardware implementation of Mean-Pooling Convolutional Neural Network (CNN) with 50\% random dropout backpropagation training. We illustrate the effect of variabilities of real memristive devices on the performance of CNN, and tolerance to the input noise. The classification accuracy of CNN is approximately $93\%$ independent on memristor variabilities and input noise. On-chip area and power consumption of analog 180nm CMOS CNN with $WO_x$ memristors are $0.09338995mm^2$ and $3.3992W$, respectively.

SS05.4 15:40-16:00 Binarized Neural Network with Stochastic Memristors

Olga Krestinskaya, Otaniyoz Otaniyozov, Alex Pappachen James*

Nazarbayev University, Kazakhstan

This paper proposes the analog hardware implementation of Binarized Neural Network (BNN). Most of the existing hardware implementations of neural networks do not consider the memristor variability issue and its effect on the overall system performance. In this work, we investigate the variability in memristive devices in crossbar dot product computation and leakage currents in the proposed BNN, and show how it effects the overall system performance.

L8 Lecture Session 8 Low Precision Neural Network

Wednesday, March 20|14:40-16:00

Ballroom C, 10F

Chair(s):

Shao-Yi Chien, National Taiwan University, Taiwan

L8.1 14:40-15:00 Exploration of Automatic Mixed-Precision Search for Deep Neural Networks

Xuyang Guo1, Yuanjun Huang*2, Hsin-Pai Cheng3, Bing Li3,5, Wei Wen3, Siyuan Ma4, Hai Li3, Yiran Chen3 1Tsinghua University, China 2University of Science and Technology of China, China 3Duke University, USA 4Xi'an Jiaotong University, China 5Army Research Office, Research Triangle Park, USA

Neural networks have shown great performance in cognitive tasks. When deploying network models on mobile devices with limited computation and storage resources, the weight quantization technique has been widely adopted. In practice, 8-bit or 16-bit quantization is mostly likely to be selected in order to maintain the accuracy at the same level as the models in 32-bit floating-point precision. Binary quantization, on the contrary, aims to obtain the highest compression at the cost of much bigger accuracy drop. Applying different precision in different layers/structures can potentially produce the most efficient model. Seeking for the best precision configuration, however, is difficult. In this work, we proposed an automatic search algorithm to address the challenge. By relaxing the search space of quantization bitwidth from discrete to continuous domain, our algorithm can generate a mixed-precision quantization scheme which achieves the compression rate close to the one from the binary-weighted model while maintaining the testing accuracy similar to the original full-precision model.

L8.2 15:00-15:20 Extended Bit-Plane Compression for Convolutional Neural Network Accelerators

Lukas Cavigelli*, Luca Benini

ETH Zurich, Switzerland

After the tremendous success of convolutional neural networks in image classification, object detection, speech recognition, etc., there is now rising demand for deployment of these compute-intensive ML models on tightly power constrained embedded and mobile systems at low cost as well as for pushing the throughput in data centers. This has triggered a wave of research towards specialized hardware accelerators. Their performance is

Page 25: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

often constrained by I/O bandwidth and the energy consumption is dominated by I/O transfers to off-chip memory. We introduce and evaluate a novel, hardware-friendly compression scheme for the feature maps present within convolutional neural networks. We show that an average compression ratio of 4.4x relative to uncompressed data and a gain of 60% over existing method can be achieved for ResNet-34 with a compression block requiring <300 bit of sequential cells and minimal combinational logic.

L8.3 15:20-15:40 Multi-level Weight Indexing Scheme for Memory-Reduced Convolutional Neural Network

Jongmin Park, Seungsik Moon, Younghoon Byun, Sunggu Lee, Youngjoo Lee*

Pohang University of Science and Technology (POSTECH), Korea

Targeting the resource-limited intelligent mobile systems, in this paper, we present a multi-level weight indexing method that relaxes the memory requirements for realizing the convolutional neural networks (CNNs). In contrast that the previous works are only focusing on the positions of unpruned weights, the proposed work considers the consecutive pruned positions to generate the group-level validations. Denoting the survived indices only for the valid groups, the proposed multi-level indexing scheme reduces the amount of indexing data. In addition, we introduce the indexing-aware multi-level pruning and indexing methods with variable group sizes, which can further optimize the memory overheads. For the same pruning factor, as a result, the memory size for storing the indexing information is remarkably reduced by up to 81%, leading to the practical CNN architecture for intelligent mobile devices.

L8.4 15:40-16:00 Outstanding Bit Error Tolerance of Resistive RAM-Based Binarized Neural Networks

Tifenn Hirtzlin1, Marc Bocquet2, Jacques-Olivier Klein1, Etienne Nowak3, Elisa Vianello3, Jean Michel Portal2, Damien Querlioz*1 1Univ Paris-Sud, France 2Univ Aix-Marseille, France 3CEA, LETI, France

Resistive random access memories (RRAM) are novel nonvolatile memory technologies, which can be embedded at the core of CMOS, and which could be ideal for the in-memory implementation of deep neural networks. A particularly exciting vision is using them for implementing Binarized Neural Networks (BNNs), a class of deep neural networks with a highly reduced memory footprint. The challenge of resistive memory, however, is that they are prone to device variation, which can lead to bit errors. In this work we show that BNNs can tolerate these bit errors to an outstanding level, through

simulations of networks on the MNIST and CIFAR10 tasks. If a standard BNN is used, up to 10^-4 bit error rate can be tolerated with little impact on recognition performance on both MNIST and CIFAR10. We then show that by adapting the training procedure to the fact that the BNN will be operated on error-prone hardware, this tolerance can be extended to a bit error rate of 0.04. The requirements for RRAM are therefore a lot less stringent for BNNs than more traditional applications. We show, based on experimental measurements on a RRAM HfO2 technology, that this result can allow reduce RRAM programming energy by a factor 30.

SS06 Special Session 6 AI in Advanced Applications

Wednesday, March 20|16:20-17:40

Ballroom B, 10F

Chair(s):

Yeong-Kang Lai, National Chung Hsing University, Taiwan

Tsung-Jung Liu, National Chung Hsing University, Taiwan

SS06.1 16:20-16:40 Modern Architecture Style Transfer for Ruin Buildings

Chia-Ching Wang1, Hsin-Hua Liu2, Soo-Chang Pei2, Kuan-Hsien Liu3, Tsung-Jung Liu*1 1National Chung Hsing University, Taiwan 2National Taiwan University, Taiwan 3National Taichung University of Science and Technology, Taiwan

In this work, we focus on building style transfer, which transforms ruin buildings to modern architecture. Inspired by Gaty's and Goodfellow's style transfer and generative adversarial network (GAN), we use CycleGAN to conquer this type of problem. To avoid the artifacts and generate better images, we add "perception loss" into the network, which is the feature loss extracted by VGG pre-trained model. We also adjust cycle loss by changing the ratio of weighting parameters. Finally, we collect images of both ruin and modern architecture from websites and use unsupervised learning to train the model. The experimental results show our proposed method indeed realize the modern architecture style transfer for ruin buildings.

SS06.2 16:40-17:00 Age Estimation on Low Quality Face Images

Kuan-Hsien Liu*1, Hsin-Hua Liu2, Soo-Chang Pei2, Tsung-Jung Liu3, Chun-Te Chang1 1National Taichung University of Science and Technology, Taiwan 2National Taiwan University, Taiwan 3National Chung Hsing University, Taiwan

In this paper, we contribute an age estimation method towards dealing with low quality face images. This is a

Page 26: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

practical and important problem because an image we received may have low resolution or be affected by some noise via transmission. Upon reviewing the literature on facial age estimation, we notice that few articles tackle this low quality image based facial age estimation problem. In our framework, we propose a newly designed deep convolutional neural networks architecture, consisting of five major steps. Firstly, we propose to use a super-resolution method to enhance the input images. Secondly, a data augmentation step is utilized to ease the training procedure. Thirdly, we use a deep network to conduct gender grouping. Fourthly, two recently proposed deep networks are modified with depthwise separable convolutions to perform age estimation within male and female groups. Finally, a fusion procedure is added to further boost age estimation accuracy. In the experiment, we use two benchmark datasets, IMDB-WIKI and MORPH-II, to verify our proposed method and also show a significantly performance improvement over two state-of-the-art deep CNN models.

SS06.3 17:00-17:20 SIFT Features and SVM Learning based Sclera Recognition Method with Efficient Sclera Segmentation for Identity Identification

Sheng-Yu He, Chih-Peng Fan*

National Chung Hsing University, Taiwan

In this work, based on local features of sclera veins, a learning based sclera recognition design is proposed for identity identification. The proposed system is partitioned into two-stage computations. The first stage is the pre-processing process, which includes pupil location, iris segmentation, sclera segmentation, and sclera vein enhancement. At the second stage, by the scale-invariant feature transform (SIFT) technology, the sclera vein features are extracted after image enhancements. By the K-means scheme, the proposed design merges the similar features together to construct a dictionary to describe the interested group features. Next, the sclera images refers the dictionary to get the histogram of group features, and the group features are fed into the support vector machine (SVM) to train an identity classifier. Finally, the sclera recognition tests are evaluated. By the UBIRISv1 dataset, the experimental results show that the recognition accuracy is up to near 100%.

SS06.4 17:20-17:40 Low Precision Electroencephalogram for Seizure Detection with Convolutional Neural Network

Nhan Truong*, Omid Kavehei

University of Sydney, Australia

Electroencephalogram (EEG) neural activity recording has been widely used for diagnosing and monitoring epileptic patients. Ambulatory epileptic monitoring devices that

can detect or even predict seizures play an important role for patients with intractable epilepsy. Though many EEG-based seizure detection algorithms have been proposed in the literature with high accuracy, their hardware implementations are constrained because of power consumption. Many commercial non-research EEG monitoring systems samples multiple electrodes at a relatively high rate and transmit the data either via a wire or wirelessly to an external signal processing unit. In this work, we studied how a reduced sampling precision affects the performance of our machine learning signal processing in seizure detection. To answer this question, we reduce the number of bits (precision) in an analog-to-digital converter (ADC) used in an EEG recorder. The outcome shows that the reduction of ADC precision down to 6-bit does not significantly reduce our convolutional neural network performance in detecting seizure onsets. As an indication of the performance, we achieved an area under the curve (AUC) more than 92% and above 96% on Freiburg Hospital and the Boston Children's Hospital-MIT seizure datasets, respectively. A possible reduction in ADC precision not only contribute to energy consumption reduction, particularly if the data has to be transmitted, but also offers an improved computational efficacy in terms of memory requirement and circuit area.

L9 Lecture Session 9 Hardware Oriented Neural Network Optimization

Wednesday, March 20|16:20-17:40

Ballroom C, 10F

Chair(s):

Youngjoo Lee, POSTECH, Korea

Hai Li, Duke University, USA

L9.1 16:20-16:40 Intelligent Policy Selection for GPU Warp Scheduler

Lih-Yih Chiou*1, Tsung-Han Yang1, Jian-Tang Syu1, Che-Pin Chang1, Yeong-Jar Chang2 1 National Cheng Kung University, Taiwan 2 Industrial Technology Research Institute, Taiwan

The graphics processing unit (GPU) is widely used in applications that require massive computing resources such as big data, machine learning, computer vision, etc. As the diversity of applications grows, the GPU’s performance becomes difficult to maintain by its warp scheduler. Most of the prior studies of the warp scheduler are based on static analysis of GPU hardware behavior for certain types of benchmarks. We propose for the first time (to the best of our knowledge), a machine learning approach to intelligently select suitable policies for various applications in runtime. The simulation results indicate that the proposed approach can maintain performance comparable to the best policy across different applications.

Page 27: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

L9.2 16:40-17:00 SMURFF: a High-Performance Framework for Matrix Factorization

Tom Vander Aa*, Imen Chakroun, Thomas J. Ashby

Imec, , Belgium

Bayesian Matrix Factorization (BMF) is a powerful technique for recommender systems because it produces good results and is relatively robust against overfitting. Yet BMF is more computational intensive and thus more challenging to implement for large datasets. In this work we present SMURFF a high-performance feature-rich framework to compose and construct different Bayesian matrix-factorization methods. The framework has been successfully used to do large scale runs of compound-activity prediction. SMURFF is available as open-source and can be used both on a supercomputer and on a desktop or laptop machine. Documentation and several examples are provided as Jupyter notebooks using SMURFF's high-level Python API.

L9.3 17:00-17:20 Spatial Data Dependence Graph Simulator for Convolutional Neural Network Accelerators

Jooho Wang1, Jiwon Kim1, Sungmin Moon1, Sunwoo Kim1, Sungkyung Park2, Chester Sungchung Park*1 1 Konkuk University 2 Pusan National University

A spatial data dependence graph (S-DDG) is newly proposed to model an accelerator dataflow. The pre-RTL simulator based on the S-DDG helps to explore the design space in the early design phase. The simulation results show the impact of memory latency and bandwidth on a convolutional neural network (CNN) accelerator.

L9.4 17:20-17:40 AIP: Saving the DRAM Access Energy of CNNs Using Approximate Inner Products

Cheng-Hsuan Cheng, Ren-Shuo Liu*

National Tsing Hua University, Taiwan

In this work, we propose AIP (Approximate Inner Product), which approximates the inner products of CNNs' fully-connected (FC) layers by using only a small fraction (e.g., one-sixteenth) of parameters. We observe that FC layers possess several characteristics that naturally fit AIP: the dropout training strategy, rectified linear units (ReLUs), and top-n operator. Experimental results show that 48% of DRAM access energy can be reduced at the cost of only 2% of top-5 accuracy loss (for VGG-f).

Page 28: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

Live Demo & Showcase

◼ Live Demo

Date/Time: March 20, 2019 | 13:10-14:40

Venue: Ballroom A, 10F

Live Demo award is kindly sponsored by the European Union funded NEUROTECH Coordination and Support

Action.

No. Topic/Authors

D1 Artificial Intelligence of Things Wearable System for Cardiac Disease Detection

Yu-Jin Lin1, Chen-Wei Chuang1, Chun-Yueh Yen1, Sheng-Hsin Huang1, Peng-Wei Huang1, Ju-Yi Chen2, Shuenn-Yuh Lee*1 1Department of Electrical Engineering, National Cheng Kung University, Taiwan 2Division of Cardiology, Department of Internal Medicine, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Taiwan

D2 Multi-task ADAS system on FPGA

Jinzhang Peng1, Lu Tian*2,1, Xijie Jia1, Haotian Guo1, Yongsheng Xu1, Dongliang Xie1, Hong Luo1, Yi Shan1, Yu Wang2 1Xilinx,Inc. 2Department of Electronic Engineering, Tsinghua University

D3 A Deep Learning Based Wearable Medicines Recognition System for Visually Impaired People

Wan-Jung Chang1,2, Yue-Xun Yu1, Jhen-Hao Chen1, Zhi-Yao Zhang1, Sung-Jie Ko1, Tsung-Han Yang1, Chia-Hao Hsu1,2, Liang-Bi Chen*1,2, Ming-Che Chen2,1 1Southern Taiwan University of Science and Technology, Taiwan 2Artificial Intelligence over Internet of Things Applied Research Center (AIoT Center), Southern Taiwan University of Science and Technology, Taiwan

D4 Flyintel – a Platform for Robot Navigation based on a Brain-Inspired Spiking Neural Network

Huang-Yu Yao*, Hsuan-Pei Huang, Yu-Chi Huang, Chung-Chuan Lo

National Tsing Hua University, Taiwan

D5 A Learnable Unmanned Smart Logistics Prototype System Design and Implementation

I-Lok Cheng1, Ching-Hwa Cheng*2, Don-Gey Liu2 1GMT Global Inc. 2Department of Electronics of Feng Chia University, Taiwan

D6 Low Precision Electroencephalogram for Seizure Detection with Convolutional Neural Network

Nhan Truong*, Omid Kavehei

University of Sydney, Australia

D7 SMURFF: a High-Performance Framework for Matrix Factorization

Tom Vander Aa*, Imen Chakroun, Thomas J. Ashby

imec, , Belgium

Page 29: Program-at-a-Glance - AICAS 2019 › img › document › AICAS2019_Technical-Progra… · Emerging Memory Technologies for Neuromorphic Circuits and Systems Lecture session 8 Low

◼ Showcase (Held by Semiconductor Moonshot Project)

Date/Time: March 20, 2019 | 13:10-14:40

Venue: Ballroom A, 10F

No. Topic Principal investigator Affiliation

SC1 Embedded deep learning technology for

ADAS applications

Prof. Jiun-In Guo iVSLab, Institute of

Electronics, National Chiao

Tung University

SC2 Portable and wireless urine detection

system and platform for prevention of

cardiovascular disease

Prof. Shuenn-Yuh Lee National Cheng Kung

University / Guard Patch

Alliance

SC3 A Fully Convolutional Neural Network for

Real - Time Semantic Segmentation of

High-Resolution Video (1k x 2k @ 60 fps)

Prof. Youn-Long Lin National Tsing Hua

University

SC4 Developments and applications of an

intelligent autonomous mover for

environments with surrounding crowds

Prof. Yin-Tsung Hwang National Chung Hsing

University/Feng Chia

University/National Formosa

University

SC5 Artificial Intelligent 3D Sensing Image

Processing System for Array Sensing

Lidar

Prof. Yuan-Hao Huang Department of Electrical

Engineering, Department of

Computer Science, National

Tsing Hua University, Taiwan

SC6 Enabling A Powerful Worldwide Care

Intelligent Cloud With Spectrochip

Prof. Weileun Fang NTUST/NTHU, SpectroChip

SC7 Enabling Technology of Object

Recognition and Tracking for Mobile

Devices – Towards a Neuromorphic

Intelligent Vision System

Prof. Kea-Tiong Tang National Tsing Hua

University

SC8 Micro Darknet for Inference and

CASLab SIMT GPU

Prof. Chung-Ho Chen National Cheng Kung

University Department of

Electrical Engineering -

Computer Architecture and

System Laboratory (NCKUEE

CASLab)