hana sps07 text analysis

16
What´s New? SAP HANA SPS 07 Text Analysis (Delta from SPS 06 to SPS 07) SAP HANA Product Management November, 2013

Upload: sap-technology

Post on 30-Jun-2015

813 views

Category:

Technology


11 download

DESCRIPTION

What´s New? SAP HANA SPS 07 - Text Analysis

TRANSCRIPT

Page 1: HANA SPS07 Text Analysis

What´s New? SAP HANA SPS 07 Text Analysis(Delta from SPS 06 to SPS 07)SAP HANA Product Management November, 2013

Page 2: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 2Public

Agenda

New or Improved Text Analysis Features Custom dictionaries Custom configurations Indexing throughput

Improved Language Coverage Social Media extraction for Japanese & Simplified Chinese Numerical extraction for Simplified Chinese  Core extraction for Russian Voice of Customer for Simplified Chinese

Related Topics Fulltext search Fuzzy search

Page 3: HANA SPS07 Text Analysis

New or Improved Text Analysis Features

Page 4: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 4Public

New Custom Dictionary Support

You can now specify your own entity types and names to be used with text analysis, which may be critical for particular industries or data domains Single custom dictionary may support all languages or a single language Custom dictionaries reside in the HANA repository and benefit from its life cycle management

Steps1. Choose the project to contain the new dictionary in the Development perspective of SAP HANA Studio.2. Enter or select a parent folder and enter the dictionary file name in the Wizard. Your text analysis dictionary file is created locally and opens

as an empty file in the text editor.3. Enter your text analysis dictionary specification into the new file and save it locally.4. Commit your new dictionary. The dictionary is now synchronized to the repository as a design time object and the icon shows the dictionary is

committed.5. Activate once you have finished editing your dictionary. The dictionary is created in the repository as a runtime object and the icon shows the

dictionary is activated. This allows you and others to use the dictionary. If you haven’t done so previously, you will need to create a custom text analysis configuration as well…

Page 5: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 5Public

New Custom Configuration Support

You can now customize the features and options used for text analysis rather than using the predefined configurations: LINGANALYSIS_BASIC LINGANALYSIS_STEMS LINGANALYSIS_FULL EXTRACTION_CORE EXTRACTION_CORE_VOICEOFCUSTOMER

Custom configurations allow you to suppress the default output and incorporate custom dictionaries. You can either: Create a new XML configuration file within SAP HANA Studio Copy one of the predefined configurations and modify it

Page 6: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 6Public

Greater Indexing Throughput

Improved scalability of the highlighted preprocessing steps: File filtering

– converting binary document formats to text/HTML

Tokenization– decompose word sequence, e.g. “the quick brown fox” -> “the” “quick” “brown” “fox”

Stemming– reduction of tokens to linguistic base form, e.g. houses -> house; ran -> run

Linguistic analysis– part-of-speech identification, e.g. quick: Adjective; houses: Plural Noun

Utilizes more threads and efficient data transfers Applies to all text analysis configurations

50% greater throughputDepending upon hardware configuration

30% less timeDepending upon hardware configuration

Page 7: HANA SPS07 Text Analysis

Improved Language Coverage

Page 8: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 8Public

Available Text Analysis Configuration OptionsLanguage LINGANALYSIS_BASIC

LINGANALYSIS_STEMSLINGANALYSIS_FULL EXTRACTION_CORE EXTRACTION_CORE_VOICEOFCUSTOMER

Arabic X

Catalan X X

Chinese (Simplified) IMPROVED IMPROVED

Chinese (Traditional) X X

Croatian X X

Czech X X

Danish X X

Dutch X

English

Farsi X

French

German

Greek X X X

Hebrew X X X

Hungarian X X X

Italian X

Japanese IMPROVED X

Korean X

Norwegian (Bokmal) X X

Norwegian (Nynorsk) X X

Polish X X X

Portuguese X

Romanian X X X

Russian IMPROVED X

Serbian X X

Slovak X X

Slovenian X X

Spanish

Swedish X X

Thai X X X

Turkish X X X

Page 9: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 9Public

Improved Social Media Extraction for Japanese & Simplified Chinese

Identifies with high recall and precision SOCIAL_MEDIA entities with corresponding offsets Tags SOCIAL_MEDIA entities such as IDs (@MyTwitterName) or topics (#MyWeiboKeyword) Distinguishes between SOCIAL_MEDIA entities and emoticons like @__@ Distinguishes between SOCIAL_MEDIA entities and emails like [email protected] Respects important Weibo and Twitter differences, Ex: #W-TOPIC# vs. #T-TOPIC1 #T-TOPIC2

Page 10: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 10Public

Improved Numerical Extraction for Simplified Chinese

Better identifies numerical entities with special characters CURRENCY – expressions denoting amounts of money

– 33.8 万元– 港币五千万– 一百四十四亿七千万美元

DATE – minimally composed of a number and month name– 7 月 2 日– 十月十七日

MEASURE – expressions– 二百五十六公斤– 5.5 米

TIME – clock times and time expressions– 8 时– 3 点零 5 分

Page 11: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 11Public

TITLE PresidentPERSON Barak ObamaPEOPLE GreeksLANGUAGE Greek

ADDRESS1 245 First Street Floor 16ADDRESS2 Cambridge, MA 02142LOCALITY CambridgeREGION@MINOR Napa CountryREGION@MAJOR ConnecticutCOUNTRY BrazilCONTINENT South AmericaGEO_FEATURE Mount FujiGEO_AREA Scandinavia

ORGANIZATION@COMMERCIAL AT&TORGANIZATION@EDUCATIONAL University of WashingtonORGANIZATION@OTHER FBIPRODUCT iPhoneTICKER NYSE:SAP

SOCIAL_MEDIA@TWITTER_ID @SAP SOCIAL_MEDIA@TWITTER_TOPIC #HANA

DATE 2/14/2011DAY MondayMONTH JuneYEAR 2011TIME 3:47pmTIME_PERIOD 3 days, from 9 to 5pmHOLIDAY Memorial Day

CURRENCY 17 euros

MEASURE 217 metersPERCENT 4%

PHONE [email protected]@sap.comURI@IP 165.14.2.0URI@URL http://sap.com Syntactic Entities:NOUN_GROUP big umbrellaPROP_MISC Cup o’ Soup

 

 

Additional Predefined Core Extractions for Russian

Page 12: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 12Public

Improved Voice of Customer Extraction for Simplified Chinese

The following major fact types are classified:

Sentiments: expression of a customer’s feelings about something Problems: a statement about something which impedes a customer’s work Requests: expression of a customer’s desire for an enhancement/change Profanity: defines a set of pejorative vocabulary Emoticons: expression of someone's feelings about the whole sentence or situation

Focuses on finer extraction of online reviews and implementing customer feedback Dramatic overall improvement in stances and topics Recall and precision testing results jumped significantly higher

Page 13: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 13Public

Disclaimer

This presentation outlines our general product direction and should not be relied on in making a purchase decision. This presentation is not subject to your license agreement or any other agreement with SAP.

SAP has no obligation to pursue any course of business outlined in this presentation or to develop or release any functionality mentioned in this presentation. This presentation and SAP’s strategy and possible future developments are subject to change and may be changed by SAP at any time for any reason without notice.

This document is provided without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. SAP assumes no responsibility for errors or omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.

Page 14: HANA SPS07 Text Analysis

Thank youContact information

Anthony WaiteSAP HANA Product [email protected]

To get the best overview of what’s new in SAP HANA SPS 07, read this blog.

Page 15: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 15Public

© 2013 SAP AG. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed without prior notice.

Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors.

National product specifications may vary.

These materials are provided by SAP AG and its affiliated companies ("SAP Group") for informational purposes only, without representation or warranty of any kind, and SAP Group shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.

SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries. Please see http://www.sap.com/corporate-en/legal/copyright/index.epx#trademark for additional trademark information and notices.

Page 16: HANA SPS07 Text Analysis

© 2013 SAP AG. All rights reserved. 16Public

© 2013 SAP AG. Alle Rechte vorbehalten.

Weitergabe und Vervielfältigung dieser Publikation oder von Teilen daraus sind, zu welchem Zweck und in welcher Form auch immer, ohne die ausdrückliche schriftliche Genehmigung durch SAP AG nicht gestattet. In dieser Publikation enthaltene Informationen können ohne vorherige Ankündigung geändert werden.

Einige der von der SAP AG und ihren Distributoren vermarkteten Softwareprodukte enthalten proprietäre Softwarekomponenten anderer Softwareanbieter.

Produkte können länderspezifische Unterschiede aufweisen.

Die vorliegenden Unterlagen werden von der SAP AG und ihren Konzernunternehmen („SAP-Konzern“) bereitgestellt und dienen ausschließlich zu Informationszwecken. Der SAP-Konzern übernimmt keinerlei Haftung oder Gewährleistung für Fehler oder Unvollständigkeiten in dieser Publikation. Der SAP-Konzern steht lediglich für Produkte und Dienstleistungen nach der Maßgabe ein, die in der Vereinbarung über die jeweiligen Produkte und Dienstleistungen ausdrücklich geregelt ist. Keine der hierin enthaltenen Informationen ist als zusätzliche Garantie zu interpretieren.

SAP und andere in diesem Dokument erwähnte Produkte und Dienstleistungen von SAP sowie die dazugehörigen Logos sind Marken oder eingetragene Marken der SAP AG in Deutschland und verschiedenen anderen Ländern weltweit. Weitere Hinweise und Informationen zum Markenrecht finden Sie unter http://www.sap.com/corporate-en/legal/copyright/index.epx#trademark.