Towards Keyword-Driven Analytical Processing
Ping Wu, Yannis Sismanis, Berthold Reinwald
Presented By – Amit Goyal ([email protected])Date: 21st Nov 2007
Outline
Background Motivation Solution Framework
Differentiate Explore
Experiments Conclusions
Motivation
Problem: Given: multiple data sources Find: patterns (such as?)
What were the sales volumes by region and product category for the last year?
Which orders should we fill to maximize revenues? Will a 10% discount increase sales volume
sufficiently?
Motivation
Users don’t know how to specify what they want, but they know it when they see it
Keyword queries can have different semantics based on the intent of user
Background
Decision Support Decision support systems are a class of computer-based
information systems including knowledge based systems that support decision making activities
Data Warehouse Main repository of an organization's historical data for
analysis purposes It contains the raw material for management's decision
support system OLAP (online analytical processing)
Interactive process of creating, managing, analyzing and reporting of data
Facts and Dimensions
Fact: Measures performance of
a business Example facts:
Sales, budget, profit, inventory
Example Fact Table: Transactions (timekey,
storekey, ckey, units, price)
Dimension: Specifies a fact
Example dimensions: Product, customer data,
sales person, store Example dimension
table: Customer (ckey,
firstname, lastname, address, dateOfBirth, occupation, …)
Database is a set of facts (points) in a multidimensional space
Example
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer NameCustomer Name
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNameSalespersonName
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryCategory
CategoryDescriptionCategoryDescription
UnitPriceUnitPrice
DateKeyDateKey
DateDate
CityNameCityName
StateState
CountryCountry
OrderOrder
CustomerCustomer
SalespersonSalesperson
CityCity
DateDate
ProductProduct
Fact TableFact Table
Notations
Dataspace DS: entire multi-dimensional dataspace in an OLAP database
Subspace DS’: subset of dataspace Star Join: set of joins when a fact table is
joined to two or more dimension tables. Different interpretations of the keywords reflect different star join expressions
Solution Framework Phase Differentiate
Generation of candidate subspaces using user keywords Phase Explore
The system first calculates for the subspace the aggregated values for some predefined measures
Then dynamically finds for each dimension the top-k interesting group-by attributes to partition the sub-dataspace
Interestingness
Application specific Focus in the paper:
Surprising aggregates Correlated aggregates
Differentiate Phase
First generate candidate subspace Then, organize them effectively by ranking
them Then, ask user to select one of candidate
subspace Proceed to Explore Phase
Ambiguity of Keyword Queries In large and complex OLAP dataspaces, a
keyword almost always matches different attribute domains in different dimensions Creates large number of possible query
interpretations Example: Consider query “Columbus LCD”
Columbus – holiday or city? LCD – projectors or TV or monitor?
Ambiguity of Keyword Queries: WHY? Correctness: Different interpretation of keywords
may result in completely different subspaces. Thus, correct interpretation may eliminate error propagation to subsequent phases in the system
Performance: Computation of all possible interpretations may be expensive
Users know exactly the semantic meaning of their keywords. Put them in query processing loop
Candidate Interpretation Generation Problem Stmt: Given a keyword query q={k1,
k2, ...kn}, generate candidate interpretations CI ={C1,C2, ...,Cm} of q.
For each keyword ki, the CI generator first probes the full-text index to obtain the Hit Set Hi. Hi = {hi
1, hi
2, .. , him}
Each hit hij represents a triplet of relation name,
attribute name and the attribute instance value {hij.R,
hij.Attr, hi
j.Val}. Within a hit set, hits can be organized in Hit Groups
if they have same relation name and attribute name.
Hit Group : Example
Consider a query “Columbus LCD” The hit set for keyword “Columbus” has 3
hits: Loc/City/Columbus, Holiday/Event/(“Columbus Day”), Holiday/Event/(“Columbus Week”).
Thus, hit group for the keyword “Columbus” is {Holiday/Event/(“Columbus Day” OR “Columbus Week”)}.
Star Seed and Star Net
Star Seed (SS): For query q, SS is defined as a set of n hit groups, each of which is drawn from a different hit set E.g. For query “Columbus LCD”, one candidate SS could
be {{Holiday/Event/(“Columbus Day” OR “Columbus Week”)}, PGROUP/Group Name/”LCD Projectors”}
Star Net (SN): For SS, SN is defined as a join that connects all the hit groups from the SS
Note: A single SS could correspond to multiple SN
Candidate Star Net Generation
Ranking the Candidate Star Nets Number of all interpretations may be large. Thus, ranking is necessary SCORE(SN,q) =
Sim :- string matching similarity function between the query q and attribute value of hit hi
j
|HGk| is the number of hits in the hit group ∑ over hits in a Hit Group divided by |HGk| is the average hit similarity value Avg hit similarity value is further normalized to penalize hit groups with
many matched attribute instances. “California” – state or a large of distinct address on “California Street”
Finally, summation of hit group scores are divided by |SN|2 to prioritize smaller Star Nets. Score(Star Nets with “San Jose{city}”) > Score(Star Nets with “San Antonio{city}”
and “Jose{Customer First Name}”)
Handling of Phrase Queries
Consider Query: “San Jose” Output hit set: “San Jose”, “San Antonio” etc Score function does not take into consideration the
fact that “San Jose” perfectly matches two keywords, thus should be ranked higher.
Solution: Merge the two hit groups from two hit sets, if Both groups are from same attribute domain The intersection between two groups is not empty
Can be generalized to phrases containing more than 2 keywords
Explore Phase
Till now, a unique sub-dataspace DS’ has been identified by the user
And the system computes the group-by aggregates over the measure from all qualified fact points in DS’
Rank the group-by attributes Categorical Attributes Numerical Attributes
Organizing Attribute Instances Categorical Attributes Numerical Attributes
Automatic Facet Construction For Sub-Dataspaces After a unique sub-dataspace DS’ has been
identified by the user, the system dynamically constructs a multi-faceted search (MFS) interface for the user to explore detailed level aggregation in DS’.
In real world databases, the number of dimensions may be large. And each dimension may have many attributes. So, need to rank group-by attributes dynamically based on
interestingness of the resulting partitions.
Ranking Group-by Attributes
Rank the group-by attributes based on interestingness of their resulting partitions
Roll-up Partitioning (RUP)
Roll-up Partitioning
By looking at the sub-dataspace alone, it is impossible to define interestingness of a certain partition in a robust way
Dimensions are hierarchical, lets use it !! Roll-up along some dimensions, compare the
two partitions. The more similar the two partitions are, the
less the candidate group-by attribute is considered as surprising
Example
Determine whether the attribute zipcode of dimension store is an interesting group-by attribute for the subspace associated with Product Television
Roll-up to bigger space along the Product dimension to Home Entertainment Electronics If distribution deviates -> surprising If correlated -> bellwethers
Roll-up Partitioning
SCORE(attrij, DS’) = - E((X-µx)(Y-µy))/(σx σy)
Where,
X = aggregation values on partition PAR(DS’, attrij)
Y = aggregation values on partition PAR(RUP(DS’), attri
j)
E = expected value
µ = mean
σ = standard deviation
Ranking both Categorical and Numerical Attributes Categorical: Easy. Previous score function can be
directly applied. Numerical: Correlation depends on how the
numerical domain is bucketized First split the domains of the candidate attributes into
“sufficiently” many buckets or basic intervals Tuples in same bucket are aggregated together to produce
new attribute values Intuition behind splitting is that the correlation value
of two distributions can be preserved as the bucket number becomes large and the interval range becomes small
Organizing Attribute Instances Till Now, we have ranked attributes How to organize the values within each
attribute domain? Categorical Attribute: SCORE(attri
j.catp, DS’) =
G is the aggregate function
Organizing Numerical Attribute Instances Given ‘m’ basic intervals, merge adjacent intervals
into ‘k’ numeric categories 3 objectives:
Number of resulting intervals should not be large (suitable for navigation)
Number of merged intervals should not be skewed. Number of intervals in largest range should not exceed L times the number of intervals in smallest range
The merged partition should preserve the original interestingness value, i.e. correlation value from basic intervals
Organizing Numerical Attribute Instances
Critique:
• What is neighbor?• How to generate it?• SCORE function is not defined
Experiments
AdventureWorks data warehouse Divided in two separate databases:
AW_ONLINE – 5 dimensions, 3 hierarchical, 10 tables
AW_RESELLER – 7 dimensions, 4 hierarchical, 13 tables
Qualitative Sample Results
Keyword Query: “California Mountain Bike” Phase1: System returns a list of star nets Analyst selects the first Star Net
Evaluation of Subspace Ranking Algorithm Manually written 50
keywords queries X-axis: Rank of the results Y-axis: %age of the queries
satisfied 4 ranking methods Relevance is checked
manually Note that group size
normalization is not significant
Effects of Bucket Number in Group-By Attribute Ranking
AW_Online database Numerical Attributes:
Yearly-Income from Customer table
Dealer-Price from Product table
Roll-up operations: StateProvinces to
Countries Subcategory to Product
Category
Test the assumption that with a “sufficiently” large number of basic intervals, the actual correlation value can be captured
Effects of Bucket Number in Group-By Attribute Ranking
AW_Reseller Database Numerical Attributes:
AnnualSales, AnnualRevenue from Reseller table
NumberOfEmployees Roll-up operations:
Product Subcategories to Categories
Error percentage is computed by the deviation from the ideal case (where each distinct value has its own bucket)
Study of Numerical Partitioning Methods
Contribution/Conclusion
Integrate keyword search with the efficient aggregation power of OLAP
Provides an efficient and easy-to-use solution for business analyst
Ambiguity problem has not been addressed by previous research
Current research on keyword search over RDBMS uses indexes on a tuple level instead of an attribute level
Critique
Poorly written paper Typo mistakes.
In section 6.3, para 1, it should be “Section 4.3” instead of “Section 3.4”
In section 6.5 para 1, it should be “Figure 7(a)” instead of “Figure 8(a)”
In eqn1, it is not clear what is |HGk|, |SN| In eqn2, it is mentioned G is an aggregate function, but didn’t
specify it In Algorithm 2, the “SCORE” function used is not defined In Algorithm 2, the notion of “neighbor” is not defined
First, they say score function (section 4.3) does not take into account that “San Jose” matches two keywords and therefore should be assigned much higher score than “San Antonio”; then in the next section, they claim that normalization factor |SN|2 takes this problem into consideration.
Questions??