title of the presentation this is the...

16
Text-To-Query Dynamically building structured analytics to illustrate textual content BEWEB 2010, Lausanne, Switzerland 2010 March 22 nd Raphaël Thollot SAP BusinessObjects ARC Ecole Centrale Paris Falk Brauer SAP Research CEC Dresden Wojciech Barczynski SAP Research CEC Dresden Marie-Aude Aufaure Ecole Centrale Paris SAP BusinessObjects Chair [email protected] [email protected] [email protected] [email protected]

Upload: others

Post on 20-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Text-To-Query

    Dynamically building structured analytics to illustrate textual content

    BEWEB 2010, Lausanne, Switzerland

    2010 March 22nd

    Raphaël Thollot

    SAP BusinessObjects ARC

    Ecole Centrale Paris

    Falk Brauer

    SAP Research CEC Dresden

    Wojciech Barczynski

    SAP Research CEC Dresden

    Marie-Aude Aufaure

    Ecole Centrale Paris

    SAP BusinessObjects Chair

    [email protected] [email protected] [email protected] [email protected]

  • © 2010 SAP AG. All rights reserved. / Page 2

    Introduction – Motivation and objectives

    Background

    A majority of information lies in unstructured

    Data management often requires a first

    structuring phase

    Most users are not database or BI experts

    Navigation structured-unstructured is still

    difficult

    Prior work did not use metadata of a warehouse

    to suggest aggregated analytics

    Use cases

    Augmented web browsing

    Supported data acquisition

    Goal: suggest relevant structured data with an appropriate

    visualization, based on unstructured user’s input

  • © 2010 SAP AG. All rights reserved. / Page 3

    Introduction – State-of-the-art

    EROCS [4, 10]: Entity RecOgnition in Context of Structured data

    Link an unstructured document to external structured data

    Text-entity associations enable consolidated BI

    Focus on the EROCS system

    EROCS links a document to additional information on extracted

    entities, but does not consider aggregated analytics

  • © 2010 SAP AG. All rights reserved. / Page 4

    Our approach

    Enable named entity recognition based on

    structured data

    Combine entities extracted in a text to build

    relevant queries

    Propose well adapted visualizations for

    queries results

    Challenges

    Building and maintaining entity dictionaries is

    costly

    Assembling entities into meaningful queries

    Explicit and implicit mentions

    Influence the visualization to better illustrate

    analysis intentions in the text

    Text-To-Query

    Visualization

    Structured data

    Text analysis

    Solution overview

  • © 2010 SAP AG. All rights reserved. / Page 5

    Solution overview – Workflow

    The solution we propose takes two steps

    Produce necessary metadata from an existing universe to enhance text

    analysis

    Extract the context of a piece of text to understand intention and expectations

    Generate a query to produce appropriate charts with

    relevant data

    Pre-processing phase

    SL

    ThingFinderThingFinder

    SL

    CVOM

    Runtime phase

    SL Semantic Layer

    ThingFinder Entity extraction technology

    CVOM Visualization framework

  • © 2010 SAP AG. All rights reserved. / Page 6

    Solution overview – Technical components

    Data warehouses

    OLAP cubes

    Measures, dimensions

    Hierarchies

    Analysis operations

    Drill-down

    Filter

    Semantic Layer

    Meaningful naming of an underlying SQL structure

    Enables queries from non-expert users

    « Revenue » « Country » « 2010 »

    Product category

    Co

    un

    try

    Aggregated measure:

    Sales revenue

  • © 2010 SAP AG. All rights reserved. / Page 7

    Solution overview – Pre-processing phase

    Automatic generation of an entity dictionary

    Category >> Entity >> Variant

    Entities described in a warehouse

    Measures

    Dimensions and instances

    An entity may appear in different forms

    Stemming

    Variants dictionary

    Typing warehouse objects

    Measures and dimension can belong to

    standard analysis categories

    E.g., dimension Country is in standard category

    Geography

  • © 2010 SAP AG. All rights reserved. / Page 8

    Solution overview – Architecture

    Universe-

    specific

    dictionary

    Web tierServer tier

    Runtime web

    service

    Runtime

    back-end

    server

    Outlook

    add-inPre-

    processor

    Text analysis SDK

    Standard analysis

    categories dictionary

    Functional

    dependenciesTyping

    metadata

    BI SDK

    Client tier

    PowerPoint

    add-in

    Browser

    extension

    Business Intelligence

    platform

    Text analysis platform

  • © 2010 SAP AG. All rights reserved. / Page 9

    Talking about sales Interested in an evolution analysis Interested in resorts

    Runtime analysis – Illustration of categories in

    the generated dictionary

    A category is described by keywords referring to the concept

    Standard subjects

    - Sales

    - Finance

    - Etc.

    Standard analysis dimensions

    - Time

    - Geography

    - Etc.

    Domain-specific dimension

    Vocabulary (partly) defined in the

    Semantic Layer (universe).

    The pre-processing phase extends the standard dictionary with

    custom entities defined in a data warehouse

    Standard analysis categories (SAC) Business entities (BE)

    « Our sales are growing in all resorts »

  • © 2010 SAP AG. All rights reserved. / Page 10

    Runtime analysis – Capturing the Data

    Analysis Context

    Data Analysis Context (DAC)

    Set of extracted entities (SAC + BE)

    Sentence-by-sentence analysis

    Maintain units of sense

    Continuous DAC update: using successive sentences to propagate key concepts

    The warehouse is not accessed when assembling queries

    Repeated for each text unit

    Segment text Stem text unitExtract entities (SAC and BE)

    Augment with previous DAC

    Group into queries Build adapted

    charts

    1 2 3

    6 5 4

  • © 2010 SAP AG. All rights reserved. / Page 11

    Runtime analysis – Building query suggestions

    from a Data Analysis Context

    Ensure all extracted SAC are represented

    – E.g., “our revenue increases in some countries”

    – Revenue + Country + Time SAC

    Choose highest level object from the warehouse

    Aggregate at the highest level in undetermined cases

    – Time dimension: Year, Quarter, Month, Week, etc.

    Group compatible measures & dimensions

    – Revenue is not compatible with Reservation year

    Filter on extracted instances of dimensions

    – E.g., “French Riviera generated more revenue than

    Bahamas Beach”

    – Add the Resort dimension

    Particular case

    – Remove dimensions with a filter on a single instance

    Influence the generated visualization / chart

    Analysis types

    – Trending

    – Contribution

    – Comparison

    – Ranking

    2. Functional dependencies1. SAC Representation

    4. Analysis type3. Filters

  • © 2010 SAP AG. All rights reserved. / Page 12

    Query Chart preview

    Revenue per Year per

    Country

    Analysis: Trending

    Revenue per Resort

    Analysis: Contribution

    Total Revenue

    Filter on Resort = French

    Riviera

    Analysis: Undetermined

    Preliminary results – Sample query

    suggestions

    Sales are growing everywhere.

    The relative importance of each resort

    to the revenue is satisfying.

    • French Riviera is doing very good.

  • © 2010 SAP AG. All rights reserved. / Page 13

    Preliminary results – Sample query

    suggestions

  • © 2010 SAP AG. All rights reserved. / Page 14

    Conclusion

    Achievements

    We leverage metadata from a warehouse

    Build dictionary for entity recognition

    We illustrate a text with corporate data

    Dynamically generated and meaningful queries

    Appropriate visualization

    Text analysis is kept simple

    Method is easy to apply with another language

    Restricted business domain

    We developed two prototype front-ends

    Office and web environments

    PowerPoint add-in

    12Sprints method

  • © 2010 SAP AG. All rights reserved. / Page 15

    Identified key issues and future work

    Key issues

    Coverage and extensibility

    Automatically generate variants for custom entities

    Increase the coverage of the standard dictionary

    Suggestions evaluation

    Estimate relevance

    Personalize suggestions

    Ongoing and future work

    Evaluation method for suggestions relevance

    Refine and extend Standard analysis categories

    Handle query suggestions on multiple business contexts / warehouses

  • © 2010 SAP AG. All rights reserved. / Page 16

    Thank you!

    Raphaël Thollot

    SAP BusinessObjects

    Academic Research Center (ARC)

    Levallois-Perret, France

    T +33 1 41 25 30 40

    [email protected]

    www.sap.com

    http://www.sap.com/