Download - Towards Keyword-Driven Analytical Processing Ping Wu, Yannis Sismanis, Berthold Reinwald Presented By – Amit Goyal ([email protected]) Date: 21 st Nov 2007

Towards Keyword-Driven Analytical Processing

Ping Wu, Yannis Sismanis, Berthold Reinwald

Presented By – Amit Goyal ([email protected])Date: 21st Nov 2007

Outline

Background Motivation Solution Framework

Differentiate Explore

Experiments Conclusions

Motivation

Problem: Given: multiple data sources Find: patterns (such as?)

What were the sales volumes by region and product category for the last year?

Which orders should we fill to maximize revenues? Will a 10% discount increase sales volume

sufficiently?

Motivation

Users don’t know how to specify what they want, but they know it when they see it

Keyword queries can have different semantics based on the intent of user

Background

Decision Support Decision support systems are a class of computer-based

information systems including knowledge based systems that support decision making activities

Data Warehouse Main repository of an organization's historical data for

analysis purposes It contains the raw material for management's decision

support system OLAP (online analytical processing)

Interactive process of creating, managing, analyzing and reporting of data

Facts and Dimensions

Fact: Measures performance of

a business Example facts:

Sales, budget, profit, inventory

Example Fact Table: Transactions (timekey,

storekey, ckey, units, price)

Dimension: Specifies a fact

Example dimensions: Product, customer data,

sales person, store Example dimension

table: Customer (ckey,

firstname, lastname, address, dateOfBirth, occupation, …)

Database is a set of facts (points) in a multidimensional space

Example

Order NoOrder No

Order DateOrder Date

Customer NoCustomer No

Customer NameCustomer Name

Customer Customer AddressAddress

CityCity

SalespersonIDSalespersonID

SalespersonNameSalespersonName

CityCity

QuotaQuota

OrderNOOrderNO

SalespersonIDSalespersonID

CustomerNOCustomerNO

ProdNoProdNo

DateKeyDateKey

CityNameCityName

QuantityQuantity

Total Price

ProductNOProductNO

ProdNameProdName

ProdDescrProdDescr

CategoryCategory

CategoryDescriptionCategoryDescription

UnitPriceUnitPrice

DateKeyDateKey

DateDate

CityNameCityName

StateState

CountryCountry

OrderOrder

CustomerCustomer

SalespersonSalesperson

CityCity

DateDate

ProductProduct

Fact TableFact Table

Notations

Dataspace DS: entire multi-dimensional dataspace in an OLAP database

Subspace DS’: subset of dataspace Star Join: set of joins when a fact table is

joined to two or more dimension tables. Different interpretations of the keywords reflect different star join expressions

Solution Framework Phase Differentiate

Generation of candidate subspaces using user keywords Phase Explore

The system first calculates for the subspace the aggregated values for some predefined measures

Then dynamically finds for each dimension the top-k interesting group-by attributes to partition the sub-dataspace

Interestingness

Application specific Focus in the paper:

Surprising aggregates Correlated aggregates

Differentiate Phase

First generate candidate subspace Then, organize them effectively by ranking

them Then, ask user to select one of candidate

subspace Proceed to Explore Phase

Ambiguity of Keyword Queries In large and complex OLAP dataspaces, a

keyword almost always matches different attribute domains in different dimensions Creates large number of possible query

interpretations Example: Consider query “Columbus LCD”

Columbus – holiday or city? LCD – projectors or TV or monitor?

Ambiguity of Keyword Queries: WHY? Correctness: Different interpretation of keywords

may result in completely different subspaces. Thus, correct interpretation may eliminate error propagation to subsequent phases in the system

Performance: Computation of all possible interpretations may be expensive

Users know exactly the semantic meaning of their keywords. Put them in query processing loop

Candidate Interpretation Generation Problem Stmt: Given a keyword query q={k1,

k2, ...kn}, generate candidate interpretations CI ={C1,C2, ...,Cm} of q.

For each keyword ki, the CI generator first probes the full-text index to obtain the Hit Set Hi. Hi = {hi

1, hi

2, .. , him}

Each hit hij represents a triplet of relation name,

attribute name and the attribute instance value {hij.R,

hij.Attr, hi

j.Val}. Within a hit set, hits can be organized in Hit Groups

if they have same relation name and attribute name.

Hit Group : Example

Consider a query “Columbus LCD” The hit set for keyword “Columbus” has 3

hits: Loc/City/Columbus, Holiday/Event/(“Columbus Day”), Holiday/Event/(“Columbus Week”).

Thus, hit group for the keyword “Columbus” is {Holiday/Event/(“Columbus Day” OR “Columbus Week”)}.

Star Seed and Star Net

Star Seed (SS): For query q, SS is defined as a set of n hit groups, each of which is drawn from a different hit set E.g. For query “Columbus LCD”, one candidate SS could

be {{Holiday/Event/(“Columbus Day” OR “Columbus Week”)}, PGROUP/Group Name/”LCD Projectors”}

Star Net (SN): For SS, SN is defined as a join that connects all the hit groups from the SS

Note: A single SS could correspond to multiple SN

Candidate Star Net Generation

Ranking the Candidate Star Nets Number of all interpretations may be large. Thus, ranking is necessary SCORE(SN,q) =

Sim :- string matching similarity function between the query q and attribute value of hit hi

j

|HGk| is the number of hits in the hit group ∑ over hits in a Hit Group divided by |HGk| is the average hit similarity value Avg hit similarity value is further normalized to penalize hit groups with

many matched attribute instances. “California” – state or a large of distinct address on “California Street”

Finally, summation of hit group scores are divided by |SN|2 to prioritize smaller Star Nets. Score(Star Nets with “San Jose{city}”) > Score(Star Nets with “San Antonio{city}”

and “Jose{Customer First Name}”)

Handling of Phrase Queries

Consider Query: “San Jose” Output hit set: “San Jose”, “San Antonio” etc Score function does not take into consideration the

fact that “San Jose” perfectly matches two keywords, thus should be ranked higher.

Solution: Merge the two hit groups from two hit sets, if Both groups are from same attribute domain The intersection between two groups is not empty

Can be generalized to phrases containing more than 2 keywords

Explore Phase

Till now, a unique sub-dataspace DS’ has been identified by the user

And the system computes the group-by aggregates over the measure from all qualified fact points in DS’

Rank the group-by attributes Categorical Attributes Numerical Attributes

Organizing Attribute Instances Categorical Attributes Numerical Attributes

Automatic Facet Construction For Sub-Dataspaces After a unique sub-dataspace DS’ has been

identified by the user, the system dynamically constructs a multi-faceted search (MFS) interface for the user to explore detailed level aggregation in DS’.

In real world databases, the number of dimensions may be large. And each dimension may have many attributes. So, need to rank group-by attributes dynamically based on

interestingness of the resulting partitions.

Ranking Group-by Attributes

Rank the group-by attributes based on interestingness of their resulting partitions

Roll-up Partitioning (RUP)

Roll-up Partitioning

By looking at the sub-dataspace alone, it is impossible to define interestingness of a certain partition in a robust way

Dimensions are hierarchical, lets use it !! Roll-up along some dimensions, compare the

two partitions. The more similar the two partitions are, the

less the candidate group-by attribute is considered as surprising

Example

Determine whether the attribute zipcode of dimension store is an interesting group-by attribute for the subspace associated with Product Television

Roll-up to bigger space along the Product dimension to Home Entertainment Electronics If distribution deviates -> surprising If correlated -> bellwethers

Roll-up Partitioning

SCORE(attrij, DS’) = - E((X-µx)(Y-µy))/(σx σy)

Where,

X = aggregation values on partition PAR(DS’, attrij)

Y = aggregation values on partition PAR(RUP(DS’), attri

j)

E = expected value

µ = mean

σ = standard deviation

Ranking both Categorical and Numerical Attributes Categorical: Easy. Previous score function can be

directly applied. Numerical: Correlation depends on how the

numerical domain is bucketized First split the domains of the candidate attributes into

“sufficiently” many buckets or basic intervals Tuples in same bucket are aggregated together to produce

new attribute values Intuition behind splitting is that the correlation value

of two distributions can be preserved as the bucket number becomes large and the interval range becomes small

Organizing Attribute Instances Till Now, we have ranked attributes How to organize the values within each

attribute domain? Categorical Attribute: SCORE(attri

j.catp, DS’) =

G is the aggregate function

Organizing Numerical Attribute Instances Given ‘m’ basic intervals, merge adjacent intervals

into ‘k’ numeric categories 3 objectives:

Number of resulting intervals should not be large (suitable for navigation)

Number of merged intervals should not be skewed. Number of intervals in largest range should not exceed L times the number of intervals in smallest range

The merged partition should preserve the original interestingness value, i.e. correlation value from basic intervals

Organizing Numerical Attribute Instances

Critique:

• What is neighbor?• How to generate it?• SCORE function is not defined

Experiments

AdventureWorks data warehouse Divided in two separate databases:

AW_ONLINE – 5 dimensions, 3 hierarchical, 10 tables

AW_RESELLER – 7 dimensions, 4 hierarchical, 13 tables

Qualitative Sample Results

Keyword Query: “California Mountain Bike” Phase1: System returns a list of star nets Analyst selects the first Star Net

Evaluation of Subspace Ranking Algorithm Manually written 50

keywords queries X-axis: Rank of the results Y-axis: %age of the queries

satisfied 4 ranking methods Relevance is checked

manually Note that group size

normalization is not significant

Effects of Bucket Number in Group-By Attribute Ranking

AW_Online database Numerical Attributes:

Yearly-Income from Customer table

Dealer-Price from Product table

Roll-up operations: StateProvinces to

Countries Subcategory to Product

Category

Test the assumption that with a “sufficiently” large number of basic intervals, the actual correlation value can be captured

Effects of Bucket Number in Group-By Attribute Ranking

AW_Reseller Database Numerical Attributes:

AnnualSales, AnnualRevenue from Reseller table

NumberOfEmployees Roll-up operations:

Product Subcategories to Categories

Error percentage is computed by the deviation from the ideal case (where each distinct value has its own bucket)

Study of Numerical Partitioning Methods

Contribution/Conclusion

Integrate keyword search with the efficient aggregation power of OLAP

Provides an efficient and easy-to-use solution for business analyst

Ambiguity problem has not been addressed by previous research

Current research on keyword search over RDBMS uses indexes on a tuple level instead of an attribute level

Critique

Poorly written paper Typo mistakes.

In section 6.3, para 1, it should be “Section 4.3” instead of “Section 3.4”

In section 6.5 para 1, it should be “Figure 7(a)” instead of “Figure 8(a)”

In eqn1, it is not clear what is |HGk|, |SN| In eqn2, it is mentioned G is an aggregate function, but didn’t

specify it In Algorithm 2, the “SCORE” function used is not defined In Algorithm 2, the notion of “neighbor” is not defined

First, they say score function (section 4.3) does not take into account that “San Jose” matches two keywords and therefore should be assigned much higher score than “San Antonio”; then in the next section, they claim that normalization factor |SN|2 takes this problem into consideration.

Questions??

Download - Towards Keyword-Driven Analytical Processing Ping Wu, Yannis Sismanis, Berthold Reinwald Presented By – Amit Goyal ([email protected]) Date: 21 st Nov 2007

Top Related