ai virtual tech talks series speech recognition on arm

33
AI Virtual Tech Talks Series Vikrant Singh Tomar and Sam Myer July 2020 Speech recognition on Arm Cortex-M 1

Upload: others

Post on 07-Nov-2021

6 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: AI Virtual Tech Talks Series Speech recognition on Arm

AI Virtual Tech Talks Series

Vikrant Singh Tomar and Sam Myer

July 2020

Speech recognition on Arm Cortex-M

1

Page 2: AI Virtual Tech Talks Series Speech recognition on Arm

AI Virtual Tech Talks Series

Date Title Host

Today Speech recognition on Arm Cortex-M Fluent.ai

August, 11 Getting started with Arm Cortex-M software development and Arm Development Studio Arm

August, 25 Efficient ML across Arm from Cortex-M to Web Assembly Edge Impulse

September 8, 2020Running Accelerated ML Applications on Mobile and Embedded

Devices using Arm NN Arm

September 22, 2020 How To Reduce AI Bias with Synthetic Data for Edge Applications Dori AI

Visit: developer.arm.com/solutions/machine-learning-on-arm/ai-virtual-tech-talks

Page 3: AI Virtual Tech Talks Series Speech recognition on Arm

Join Us at Arm DevSummit

O C T O B E R 6 - 8 , 2 0 2 0 V I R T U A L C O N F E R E N C E

devsummi t .a rm.com/ teamarm

Page 4: AI Virtual Tech Talks Series Speech recognition on Arm

Speakers

4

Vikrant Singh Tomar Sam Myer

Page 5: AI Virtual Tech Talks Series Speech recognition on Arm

Overview• About Fluent.ai• Fluent.ai µCore• Demos

5

Page 6: AI Virtual Tech Talks Series Speech recognition on Arm

About Fluent.ai• Founded in 2015 after over 7 years of ground-breaking machine

learning/AI research by international thought-leaders• Research partnerships with many leading research labs and

institutions• Strong and experienced team of leading scientists, engineers, sales

staff and managers/executives (~25)• Working with customers in North America, Europe and Asia (robust,

multilingual and off-line).

Strong Institutional Backers

6

Page 7: AI Virtual Tech Talks Series Speech recognition on Arm

Fluent.ai Technology

Please

turn

on

the

light

s

Please turn on the lights

Lights On

Lights please

Lights please

Please turn on the

lightsPlease turn on the

lights

Please turn on the lights

Please

turn

on

the

light

s

Ligh

ts O

n

Please turn on the lights

Please turn on the lights

Please turn on the lights

Lights On

Please turn on the lights

Please turn on the lights

Speech to Text NLP

End to end Spoken Language UnderstandingConventional approaches

End to end SLU

• Smaller training data needs• Higher accuracy and

robustness against noise• Offline and personalizable• Any language, multi-language

7

Page 8: AI Virtual Tech Talks Series Speech recognition on Arm

Our Right to Win

8

“Fluent’s unique models can be trained quickly to deliver the required

accuracy in many dialects, languages and noise conditions and be embedded on the world’s

devices.”

– William Tunstall-Pedoe, Founder of Evi(Acquired by Amazon Alexa) Advisor/Investor in Fluent.ai

Large Global Opportunity & Demand

Page 9: AI Virtual Tech Talks Series Speech recognition on Arm

• Wake up your voice-enabled device with one of

our low-power keyword spotting solutions that

beat state of the art systems.

• Less than 5% FRR at 3FAs/24hrs

9

Single, Multiple,

User-Trainable

Smart-homeDevices

Smart Toys

System Req. RAM Storage Latency Min. Freq.

Arm Cortex M4 25 KB 200 KB 100 ms 48 MHz

WearablesCar

InfotainmentRobotics

&Industrial IoT

Voice Remotes

Page 10: AI Virtual Tech Talks Series Speech recognition on Arm

Demo: Multiple WWs on NXP LPC55S69 running Arm Cortex M33• Arm Cortex-M33 microcontroller running at up to 150

MHz

10

Page 11: AI Virtual Tech Talks Series Speech recognition on Arm

11

• Offline/on-device for guaranteed privacy and security

• Any language and accent, multiple languages

concurrently

• Personalizable by the end-user

• Lower development cost, faster Time-to-Market

World’s first end to end spoken

language understanding system

=>faster, more flexible and more

accurate voice user interfaces than

conventional technologies.

Voice AI for on-device speech understanding

Smart Speakers

Smart Home Hubs

Wearables

Voice Remotes

System Req. RAM Storage Latency Min. Freq.

Arm Cortex M4 100 KB 550 KB < 200 ms 48 MHz

Car Infotainment Wearables

Page 12: AI Virtual Tech Talks Series Speech recognition on Arm

Demo: Multiple WWs and multilingual intent on Arm Cortex M4• Arm Cortex-M4 microcontroller running at up

to 100 MHz.

12

bit.ly/fluent_ww_air_cortexm4

Page 13: AI Virtual Tech Talks Series Speech recognition on Arm

Community contributions• Fluent Speech Commands Dataset

• Speech to intent dataset• ~28,000 utterances from ~100 speakers• 31 intents, 254 commands• Link: bit.ly/fluent-speech-commands• Downloaded over 500 times. Many research papers.

• Speech Brain Project at MILA• A PyTorch based speech toolkit

13

Page 14: AI Virtual Tech Talks Series Speech recognition on Arm

Fluent.ai µCore

14

Page 15: AI Virtual Tech Talks Series Speech recognition on Arm

Fluent.ai µCore• Proprietary low-resource spoken language understanding

library• Detects wakephrase(s) or keywords + commands• Optimized for Arm Cortex-M series, e.g., M4, M33, M7

15

Page 16: AI Virtual Tech Talks Series Speech recognition on Arm

Challenges• Taking neural networks from GPUs to MCUs:

• Low-footprint (memory and CPU usage)• Real-time processing for low latency recognition⎬

• Cross-platform & modular• Uses Arm CMSIS-DSP & CMSIS-NN

• Take advantage of assembly optimizations on Arm platforms• Liberal Apache 2.0 license

16

Fluent.ai µCore

Fluent.ai Transformer

Page 17: AI Virtual Tech Talks Series Speech recognition on Arm

Transforming models

• Trained model on GPU using PyTorch• Perform post-processing• Generate C++ code describing model• Compile model C++ code with library 17

• Generates C++ code• Conditional compilation• 8-bit quantization• Weight reordering

Fluent.ai µCore

Page 18: AI Virtual Tech Talks Series Speech recognition on Arm

Real-time processingConvolution in existing libraries designed for images, not time-series

Training - batched utterances Inference - streaming audio

Entire utterance is available Audio streamed one frame at a time

Decoding latency not considered Latency minimized for good user experience

Finite utterance length Continuously listening, NN applied in overlapping windows

Activations for entire network stored in memory

Memory usage must be minimized

18

Page 19: AI Virtual Tech Talks Series Speech recognition on Arm

Layer types• Streaming layer types

• Unidirectional recurrent layers (GRU, LSTM)• Convolution / depthwise-separable• Windowed functions (e.g. MaxPool)• Streaming cumulative functions (e.g. GlobalMaxPool)• Skip connections• Activation functions (ReLU, sigmoid, tanh)

19

Page 20: AI Virtual Tech Talks Series Speech recognition on Arm

NN Layer structure• All NN weights are stored in Flash

• Arm MCU platform allows network weights to be fetched layer by layer

• Only activation buffer is stored in RAM

• Process function• Uses CMSIS• Calculates activations as data is

received and updates buffer• 1 frame input/output

20

Page 21: AI Virtual Tech Talks Series Speech recognition on Arm

Sequence of layers• Layers joined in sequence• Input frame propagates through layers• Layers are independent

21

Page 22: AI Virtual Tech Talks Series Speech recognition on Arm

Convolution exampleStreaming convolution, kernel size = 3, stride = 2

Process

Buffer OutputInput

22

Page 23: AI Virtual Tech Talks Series Speech recognition on Arm

Convolution exampleStreaming convolution, kernel size = 3, stride = 2

Process

Buffer

NULL

OutputInputFrame

1

23

Page 24: AI Virtual Tech Talks Series Speech recognition on Arm

Convolution exampleStreaming convolution, kernel size = 3, stride = 2

Process

Buffer

NULL

OutputInputFrame

2

24

Page 25: AI Virtual Tech Talks Series Speech recognition on Arm

Convolution exampleStreaming convolution, kernel size = 3, stride = 2

Process

Buffer OutputInputFrame

3

25

Page 26: AI Virtual Tech Talks Series Speech recognition on Arm

Convolution exampleStreaming convolution, kernel size = 3, stride = 2

Process

Frame

3

26

Page 27: AI Virtual Tech Talks Series Speech recognition on Arm

Convolution exampleStreaming convolution, kernel size = 3, stride = 2

Process

Buffer

NULL

OutputInputFrame

4

27

Page 28: AI Virtual Tech Talks Series Speech recognition on Arm

Advantages of streaming NN• No need to keep features/activations for entire utterance

• => lower memory requirements• Live processing while the user is speaking

• => lower latency• Redundant calculations eliminated by not using

overlapping windows• => lower CPU usage

28

Page 29: AI Virtual Tech Talks Series Speech recognition on Arm

µCore vs tflite-micro• Same Fluent wakeword model running on µCore and

tflite on a Linux machine

26.5 53 48 80

250

770692

80

300

163 147

400

Tensor RAM (kB) Decoding t ime (ms)

MIPS (MHz/s) Output interval (ms)

Fluent tflite-micro win=1.16s tflite-micro win=1.48s

29

Page 30: AI Virtual Tech Talks Series Speech recognition on Arm

µCore Summary• Low footprint

• CPU efficient code, reduced model size & memory load operations

• small code size — to fit into limited flash memory• small memory footprint (RAM)• e.g., Arm Cortex M4, M33 @48 MHz

• Streaming NN: optimized for low latency / real-time operations

• Cross platform

30! Cheaper yet effective device designs !

Page 31: AI Virtual Tech Talks Series Speech recognition on Arm

more demos

31

Page 32: AI Virtual Tech Talks Series Speech recognition on Arm

Demo: Smart-home voice control on Cortex M7

33

Page 33: AI Virtual Tech Talks Series Speech recognition on Arm

Thank YouDankeMerci谢谢

ありがとうGracias

Kiitos감사합니다

ध"यवादارًكشהדות

AI Virtual Tech Talks Series

34