boosting speech recognition performance by 5x - intel€¦ · boosting speech recognition...

Case Study

Boosting Speech RecognitionPerformance by 5x

Qihoo360 Technology Co. Ltd. Optimizes its Applications on Intel® Architecture

The Euler* platform is an important distributed offline computing platform thatsupports machine-learning-related computation models for real businesses. ForQihoo360 Technology Co. Ltd., a Chinese Internet security company, the perform-ance of the Euler platform on Intel® Xeon® processor-based servers is critical. Thecompany collaborated with Intel to optimize its applications on Intel® architecture.Engineers from both companies worked together closely to optimize the speechrecognition module of the Euler platform using Intel® Math Kernel Library (Intel®MKL), speeding up performance by 5x.

Business Requirements

Qihoo360 has about 500 million monthly active Internet users and over 640 mil-lion mobile users. Recognizing security as a fundamental need of all Internet andmobile users, the company built its user base by offering comprehensive, effec-tive, and user-friendly Internet and mobile security products and services to pro-tect users' computers and mobile devices against malware and maliciouswebsites. Besides its 360 Safe Guard* and 360 Internet Security* products,Qihoo360 has also released new products, including 360 Cloud*, 360 Browser*,360 Search*, and 360 Mobile Assistant*.

In the mobile Internet era, with a rapidly growing business and user base,Qihoo360 faces growing pressure to support its increasing user base and busi-ness and reduce the total cost of ownership. In Qihoo360’s data center, mostservices and applications run on the Intel architecture platform. And it has be-come increasingly urgent to improve their performance and make them run effi-ciently on Intel architecture.

Qihoo360 Euler Overview

Qihoo360's Euler platform is a distributed offline computing platform for large-scale data processing that adopts mainly a full in-memory computation model.The Euler platform is designed to efficiently handle computation models that re-quire multiple iterations, such as machine learning. Machine learning is a scientific

Intel® Math Kernel Library, Intel® C++ Compiler

High-Performance Computing

Figure 1. Qihoo360's Euler platform architecture

“We are glad that we engagedwith Intel to apply Intel® MKLand Intel® C/C++ Compiler toimprove the computing speedof our Euler ASR module onIntel® architecture. This willhelp us to maintain our tech

leadership in the industry andto grasp new opportunities to

grow our business.”

—Dr. Bai MingArchitect

Qihoo360Technology Co. Ltd.

Boosting Speech Recognition Performance by 5x 2

discipline that explores the constructionand study of algorithms that can learnfrom data. Such algorithms operate bybuilding a model based on inputs andusing that model to make predictions ordecisions rather than following only ex-plicitly programmed instructions. TheEuler platform is the foundational plat-form of machine learning at Qihoo360,playing a very important role in the com-pany's business. Figure 1 shows theEuler architecture.

The Euler platform integrates numerousfeatures, including:

• Resource management

• Task scheduling

• Data partition

• Parallel computation

• Data dependence

• Network communication

• Data barrier

• Fault-tolerant, distributed framework

• Distributed data structure

Euler provides an API and library (C++only) and framework to Qihoo360's servicedevelopers, who can then focus on ma-chine-learning-related usage development.

The Euler platform serves as the offlinetraining model for Qihoo360's business,including its click-through rate estimatemodel, page ranking, white listing, mo-bile assistant recommendations, imagesimilarity computation in image searches,network attack behavior analysis model,word2vec model, and speech recogni-tion. In the Euler platform, input andoutput (I/O) data are from a distributedfile system (such as HDFS*). Besides this,there is no other I/O overhead, makingthe Euler platform a typical computa-tion-intensive workload.

Improving the computational perform-ance of Euler on Intel Xeon processor-based servers is a big challenge forQihoo360. The speed of training the ma-chine learning model in the Euler plat-form (e.g., speech recognition modeltraining) has a direct impact onQihoo360's business.

Engineers from both Qihoo360 and Intelworked closely together to optimize thekey speech recognition module inQihoo360's Euler platform. Speechrecognition is the translation of spokenwords into text, also known as automaticspeech recognition (ASR). Speech recog-nition is used by Qihoo360's voicesearch service.

Maintaining Tech Leadership and MaximizingNew Growth Opportunities

Linux*, OS X*, and Android*), includingIntel® C/C++ Compiler and Intel® MKL.

Intel® C/C++ Compiler Overview

Intel engineers designed Intel C/C++Compiler for leading performance onIntel and compatible processors. Thecompiler has unique capabilities to takeadvantage of the new multicore proces-sors and multiprocessor systems. It auto-matically generates optimized code forall Intel architecture-based platforms in-cluding automatic vectoring and parallel-ing, memory and cache line tuning, aswell as serious high-level optimization. Itexplores how to complete a task withminimal CPU cycles and is compatiblewith Microsoft Visual C++* on Windows*and GNU Compiler Collection* (GCC*) onLinux*. It also scales forward with multi-core, many-core, and multiprocessor sys-tems with OpenMP, automatic parallelism,and Intel® Xeon Phi™ coprocessor sup-port. All compiler features help develop-ers cut development time, costs, and risk,as well as to achieve the best possibleperformance on Intel architecture.

Intel® MKL OverviewIntel MKL (Figure 3) is a feature-rich,high-performance mathematical libraryfor building scientific, engineering, and fi-nancial applications. It gives developersbetter performance with minimal effort.Intel MKL is optimized for the latest Intel®processors, including Intel's multicoreand many-core processors. It can unleashthe maximum computing and parallelprocessing capability of the Intel Xeonand Xeon Phi product families. (See theIntel MKL release notes for the full list ofsupported processors.)

Intel MKL provides rich functions, includ-ing Basic Linear Algebra Subprograms(BLAS) and Linear Algebra PACKage (LA-PACK) routines, Fast Fourier transforms,vector math functions, random numbergeneration functions, and other function-ality. It also provides interfaces to defacto standard APIs including C++, For-tran, C#, Java, Python, and R.

Figure 2. Large vocabulary speech recognition in the Euler platform

Figure 3. Intel® Math Kernel Library

Table 1. CBLAS functions in DNN modeling

Speech Recognition OptimizationASR is a typical large-scale, compute-in-tensive module in Qihoo360's Euler plat-form. Deep Neural Networks (DNN) wasused for the task. As Figure 2 shows, afterpreprocessing a massive amount ofspeech data, all floating-point data werefed into the DNN. After repeated itera-tions, an acoustic model was established.During processing, there were huge ma-trix and vector operations that became abottleneck for the whole project.

Computation speed became a bottleneckin Qihoo360’s ASR module, but simplyexpanding the server pool was not acost-effective solution. The companyneeded an optimized software solution.

Intel® Parallel Studio XE is a suite of soft-ware development tools that can maxi-mize Intel architecture capabilities andmake it easier to deploy advanced tech-nologies. It includes a full set of high-perfor-mance libraries for all Intel architecture-based client and server platforms formultiple operating systems (e.g., Windows*,

Boosting Speech Recognition Performance by 5x 3

https://software.intel.com/en-us/intel-sdp-home/

https://software.intel.com/en-us/intel-mkl

https://software.intel.com/en-us/articles/intel-mkl-112-release-notes

Intel MKL BLAS library can be integratedwith many third-party libraries includingMATLAB*, NumPy*, Scipy* and R, Ar-madillo*, and Boost* uBLAS.

Intel CC and Intel MKL Based Solution As explained earlier, Qihoo360 needed away to optimize its compute-intensiveapplication. Originally, Qihoo360 usedthe ATLAS CBLAS library as the basiccomponent in its computation layer withGCC as the default compiler. The CBLASlibrary is the C interface of the BLAS li-brary, so it is often used as a basic com-puting building block to support theupper-level application and has a defacto standard API. A developer canchange the CBLAS library to an opti-mized BLAS library without changingany code. Intel MKL provides the BLASlibrary and the CBLAS library, whichhave the same API as a standard CBLASlibrary. The Intel Compiler can extracttop performance on Intel architecture,so Qihoo360's Euler team sought to re-place the GCC compiler and CBLAS li-brary with the Intel Compiler and IntelMKL to build the ASR module.

As Figure 2 shows, DNNs are the mostimportant part of the ASR module. It isconstructed by an extensive linear alge-

Architecturally, the Intel MKL BLAS li-brary is optimized with Single Instruc-tion Multiple Data instructions such asAdvanced Vector Extensions (AVX andAVX2). With multi-threaded features, itcan be used as a performance buildingblock in many math libraries and appli-cations. BLAS itself is a popular mathcomputing routine set and a de factostandard API for linear algebra libraries.It performs common linear algebra op-erations such vector scaling, vector dotproduct, matrix vector multiplication,and matric multiplication (Table 1).

BLAS was first implemented as a Fortranlibrary in 19791 by netlib. There are sev-eral variations and alternative imple-mentations such as CBLAS* (C interfaceto the BLAS), highly optimized imple-mentations such as Intel MKL and AMDACML*, as well as GotoBLAS* andATLAS* (a portable, self-optimizingBLAS). BLAS libraries are widely used asbuilding blocks in high-level math pro-gramming languages and libraries in-cluding LINPACK*, LAPACK*, MATLAB*,GNU Octave*, Mathematica*, NumPy*,and R* and in all kinds of computationapplications. Developers can easily re-place any BLAS library with the IntelMKL BLAS library and speed the BLAScomputation on Intel architecture. The

Boosting Speech Recognition Performance by 5x

bra operation, which is supported by theCBLAS API (Table 1).

We take the matrix multiply functioncblas_sgemm. The code in the left col-umn of Table 2 is the prototype of thefunctionality, while the right column dis-plays the CBLAS implementation.

We built the function cblas_dgemmwith ATLAS CBLAS along with the OS(from the Red Hat yum* repository) andthe Intel MKL CBLAS library (as part ofIntel Parallel Studio XE 2013 SP1). Theoriginal code is unchanged. The onlything we changed was the link lineshown in Table 2, replacing the ATLASCBLAS library with Intel MKL BLAS. Be-cause the Intel MKL is optimized for newAVX instructions and well-tuned formulticore platforms, Intel MKL has over-whelmingly dominant performancecompared to OS-associated ATLASCBLAS, even with the multithreaded ver-sion of libptcblas.so . The first categoryin Figure 4 is the speedup result ofcblas_sgemm using mkl blas compar-ing using the original ATLAS BLAS alongwith the OS. The performance metric isthe speedup. We can see in Figure 4 thatthe Intel MKL BLAS is 7x faster thanoriginal ATLAS BLAS on our test ma-chine (Intel® Xeon® processor E5-26802.70GHz, 2 sockets and 8 cores, AVXsupport, Red Hat Enterprise Linux*Server release 6.3).

Furthermore, since Intel MKL is welltuned for other Intel processors, whenQihoo360 upgrades to a new hardwaresolution like AVX2, it can automaticallytake advantage of the new hardware viathe Intel MKL CBLAS library and keepseeing outstanding performance.

Besides the CBLAS functions, the ASRmodule also includes many vector oper-ations. A common function func_12() is an example. It includes all kinds ofoperations like finding the maximumvalue of the vector, computing the expo-nent of one vector, and summing them.Originally, we tried to build the functionwith GCC and apply some CBLAS func-tions. We got some performance im-provement, but no speedup since most

Table 2. Build function cblas_sgemm with ATLAS BLAS and Intel® MKL

4

Boosting Speech Recognition Performance by 5x

of them are of O(N) computational com-plexity and hard to multi-thread. Becausethe Intel Compiler can automatically gener-ate optimized code, including autovec-toring and parallelizing and memoryand cache line tuning, building with theIntel Compiler improved performance11x. (The second bar in Figure 4 showsthis speedup.)

These test results inspired Qihoo360 toadopt the Intel Compiler and Intel MKLinto the whole module. With the IntelMKL CBLAS library, Intel MKL VectorMath library, Intel Compiler, and someoptimization techniques—includingmemory alignment, CPU KMP affinity,vectorization, and SSE instruction—ASRperformance improved about 5x overthe previous implementation (Figure 4).

SummaryJust replacing the default CBLAS withthe Intel MKL CBLAS library letQihoo360's developers unleash the per-formance of Intel processors. Moreover,Intel MKL is optimized for Intel's existingand future processors, so the applica-tion can help Qihoo360 automaticallygain performance on a new processorwithout any code changes. At the sametime, the Intel Compiler can help devel-opers further improve performance easilyand flexibly. With these tools, developerscan achieve great performance speedupwith less effort, improving the efficiencyof their application development, savingmaintenance time, and enjoying betterscalability on Intel architecture.

Qihoo360 is satisfied with the resultsand plans to promote the solution withfuture development on other compute-intensive modules.

Table 3. Build function func_12 with GCC* and Intel® C++ Compiler

Figure 4. ASM optimization results

Download the free CommunityLicensed Version of Intel® MKL >

5

3

33

http://software.intel.com/free_tools_and_libraries?utm_campaign=CMD&utm_source=PUM23&utm_medium=article&utm_content=Qihoo%20article

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer, or learn more at www.intel.com. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured usingspecific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performancetests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/perfor-mance. Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sitesor others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. This document and the information given are for the convenience of Intel’s customer base and are provided “AS IS” WITH NO WARRANTIES WHATSOEVER, EXPRESS OR IMPLIED, INCLUDING ANY IM-PLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS. Receipt or possession of this document does notgrant any license to any of the intellectual property described, displayed, or contained herein. Intel® products are not intended for use in medical, lifesaving, life-sustaining, critical control, or safety sys-tems, or in nuclear facility applications. Copyright © 2015 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Printed in USA 1015/SS Please Recycle

boosting speech recognition performance by 5x - intel€¦ · boosting speech recognition...

Documents