discovering interesting subsets using statistical analysis

31
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune, MH, India, 411013 {maitreya.natu, gk.palshikar}@tcs.com

Upload: kaleb

Post on 15-Jan-2016

21 views

Category:

Documents


1 download

DESCRIPTION

Discovering Interesting Subsets Using Statistical Analysis. Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune , MH, India, 411013 { maitreya.natu , gk.palshikar }@ tcs.com. Concept of Interesting Subsets. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Discovering Interesting Subsets Using Statistical Analysis

Discovering Interesting Subsets Using Statistical Analysis

Maitreya Natu and Girish K. PalshikarTata Research Development and Design Centre (TRDDC)

Pune, MH, India, 411013{maitreya.natu, gk.palshikar}@tcs.com

Page 2: Discovering Interesting Subsets Using Statistical Analysis

Concept of Interesting Subsets

• In many real-life applications, it is important to automatically discover subsets of records which are interesting with respect to a given measure– Database of customer support tickets: subsets of

tickets that have very high or very low service time– Database of employee satisfaction survey: subsets of

employees that have very high or very low satisfaction index

– Database of sales orders: subsets of orders that took very large or very small time to fulfill the order

Page 3: Discovering Interesting Subsets Using Statistical Analysis

Insights from Interesting Subsets

• Interesting subsets provide insights for improving business processes, e.g.,– Identification of• bottlenecks• areas of improvement• ways to increase per-person productivity

– What-if/Impact analysis• Finding the most effective way to improve the overall

service time by x%

Page 4: Discovering Interesting Subsets Using Statistical Analysis

ISD Vs. Other Related Work

• Anomaly detection– ISD focuses on finding interesting subsets, rather

than individual interesting records• Top-k heavy hitter analysis– In top-k heavy hitter analysis, each record in top-k

subset has an unusual (high or low) value for the measure.

– We wish to identify common characteristics of the records in the interesting subsets (rather than individual interesting records)

Page 5: Discovering Interesting Subsets Using Statistical Analysis

Application on the domain of customer support tickets

• Attributes:– Timestamp-begin, Timestamp-end, Priority, Location, Resource, Problem, Solution, Solution-

provider, etc.– Service-time

• Discovery of interesting subsets of tickets that have very high service times, as compared to the rest of the tickets

Page 6: Discovering Interesting Subsets Using Statistical Analysis

Two Central Questions

• How to construct subsets of records?• How to measure interestingness of records?

Page 7: Discovering Interesting Subsets Using Statistical Analysis

How to Construct Subsets of Records?

• SQL-like SELECT commands provide an intuitive way for the end-user to characterize and understand subset of records

• We systematically explore subsets of records of a given table by increasingly refining the condition part of the SELECT command

Page 8: Discovering Interesting Subsets Using Statistical Analysis

How to Construct Subsets of Records?

• Attributes of customer support database records:– PR (Priority), CT (Category), AC (Affected City)

• Domain of each attribute– DPR = {L, M, H}, DCT = {A, B}, DAC = {X}

• A descriptor {(PR, L), (AC, ‘New York’)} corresponds to the subset of records selected using– SELECT * from D WHERE PR = L and AC = ‘New York’

• The level of a descriptor is defined as the number of attributes in a descriptor

Page 9: Discovering Interesting Subsets Using Statistical Analysis

How to Construct Subsets of Records?

Page 10: Discovering Interesting Subsets Using Statistical Analysis

How to Measure Interestingness of Records?

• A : subset of database D• A’ : D – A, records in D that are not in A• Ф(A) : set of measure values for records in A• Ф(A’) : set of measure values for records in A’• We say A is an interesting subset if Ф(A) is

statistically different from Ф(A’)• E.g.: In customer support database, a subset A

would be interesting is service times of tickets in A are statistically very different from the service times of the rest of the tickets in A’

Page 11: Discovering Interesting Subsets Using Statistical Analysis

How to Measure Interestingness of Records?

• More formally, A is an interesting subset of D if the probability distribution of the values in the subset Ф(A) are very different from that of the subset Ф(A’)

• We use statistical hypothesis tests (Student’s t-test) to measure interestingness

• Note that we focus on interestingness of subsets of tickets rather than individual tickets themselves

Page 12: Discovering Interesting Subsets Using Statistical Analysis

Student’s t-test• Student’s t-test makes a null hypothesis that the means of two sets

do not differ significantly• Let X and Y be the two set of numbers of sizes n1 and n2• The t-statistic for the unpaired sets X and Y assumes unequal

variance and unequal sizes and tests whether means of the two sets are statistically different

• Denominator is a measure of the variability of the data and is called standard error of difference

• t-test then calculates a p-value which is the probability of obtaining a result as extreme as the one actually observed, given that the null hypothesis is true

• If p-value is below a threshold, the null hypothesis is rejected

Page 13: Discovering Interesting Subsets Using Statistical Analysis

How to Measure Interestingness of Records?

• For the performance metric values of each subset of tickets in A and its complement A’, we run the Student’s t-test and compute a t-value and a p-value

• The t-value is positive if the mean of the first subset is larger than the second subset, and negative if smaller

• The p-value provides the probability that the subset A is statistically different from its complement A’

Page 14: Discovering Interesting Subsets Using Statistical Analysis

How to Deal with the Large Search Space?

• Search space of all attribute combinations can be very large in real data sets

• We present three heuristics to prune the search space– Size heuristic– Goodness heuristic– p-prediction heuristic

Page 15: Discovering Interesting Subsets Using Statistical Analysis

How to Deal with the Large Search Space?

Priority = L Priority = M

Lev

el 1

Stag

e 1

Stag

e 2

Priority = H Priority = I

Priority = LPriority = M

Priority = LPriority = H

Priority = LPriority = I

Priority = MPriority = H

Priority = MPriority = I

Priority = HPriority = I

Priority = LPriority = MPriority = H

Priority = LPriority = MPriority = I

Priority = LPriority = HPriority = I

Priority = MPriority = HPriority = ISt

age

3

Page 16: Discovering Interesting Subsets Using Statistical Analysis

How to Deal with the Large Search Space?

Priority = L Priority = M

Lev

el 1

Stag

e 1

Stag

e 2

Priority = H Priority = I

Priority = LPriority = M

Priority = LPriority = H

Priority = LPriority = I

Priority = MPriority = H

Priority = MPriority = I

Priority = HPriority = I

Priority = LPriority = MPriority = H

Priority = LPriority = MPriority = I

Priority = LPriority = HPriority = I

Priority = MPriority = HPriority = ISt

age

3

Page 17: Discovering Interesting Subsets Using Statistical Analysis

How to Deal with the Large Search Space?

• Size heuristic– Subsets on very small sizes can be noisy leading to

incorrect inference of interesting subsets– We apply a threshold Ms and do not explore

subsets with size less than Ms

Page 18: Discovering Interesting Subsets Using Statistical Analysis

How to Deal with the Large Search Space?

• Goodness heuristic– In the case of customer support tickets we are

interested in identification of tickets with large service times

– The set of tickets with service time significantly smaller than the rest of the tickets can be pruned

– Prune a set if the t-test result has a t-value < 0 and p-value < Mg

Page 19: Discovering Interesting Subsets Using Statistical Analysis

How to Deal with the Large Search Space?

• p-prediction heuristic– A level k subset is built from two level k-1 subsets– We observed that if two level k-1 subsets are

statistically very different mutually, then the corresponding level k subset built from the two sets is likely to be less different from its complement

– The heuristic prevents combination of two subsets that are statistically very different, where the statistical difference is measured by t-test

Page 20: Discovering Interesting Subsets Using Statistical Analysis

Accuracy of the p-prediction Heuristic

• For sets with p-value p1 and p2 and mutual p-value p12, the p-value prediction heuristic states that– If p12 < Mp then p3 > min(p1, p2)where p3 is the p-value of the combined set

• Accuracy = % of the number of subset pairs with p12 less than Mp that hold the p-value prediction property

Page 21: Discovering Interesting Subsets Using Statistical Analysis

Interesting Subset Discovery Algorithm

• Build a level k subset from two level k-1 subsets– Two level k-1 subsets can be combined that have exactly one

different attribute-value pair– Check the p-prediction heuristic and skip the set if the mutual p-

value of the two level k-1 sets is less than Mp.• Compute the interestingness of the subset by applying t-

test • If the t-value and p-value of the set is above the threshold

of statistical significance, store the set descriptor in the result set R

• Apply the size and goodness heuristic on level k sets to decide if the sets should be used further for building sets of subsequent levels

• Sort the result set R on increasing p-value

Page 22: Discovering Interesting Subsets Using Statistical Analysis

Interesting Subset Discovery Algorithm(using sampling)

• ISD algorithm reduces search space using various heuristics, but for very large data sets (in order of millions of records) the search space can still be very large

• We hence propose interesting subset discovery using sampling of the data set

Page 23: Discovering Interesting Subsets Using Statistical Analysis

Interesting Subset Discovery Algorithm(using sampling)

• The algorithm is based on the following observations:– A small number of interesting subsets that give major

insight into functioning and improvement of the system is preferred over a large number• Such sets give immediate actions items for major

improvement– Out of all the interesting subsets, the subsets that

have major impact on the overall system performance are of more importance• Such sets can provide insight into the areas of system

improvement that can have maximum impact

Page 24: Discovering Interesting Subsets Using Statistical Analysis

Interesting Subset Discovery Algorithm(using sampling)

• Proposed algorithm– Take samples of original data set and run ISD

algorithm on the samples– Rank the results of all the runs based on the

number of occurrences of a subset descriptor in results of different samples• The larger the number of occurrences, higher the rank

– If the rank is less than a predefined threshold, then remove the subset from the result

Page 25: Discovering Interesting Subsets Using Statistical Analysis

Experimental Results• Experimental setup

– Data set of service request records of the IT division of a major financial institution

– 6000 records– 7 Attributes (PR, AC, ABU, AI, CS, CT, CD) with domain sizes (4,

23, 29, 48, 4, 9, 7) respectively– Each record contains a Service_Time attribute as a performance

metric• We compare the results of ISD algorithm with Brute Force

and Random algorithms– Brute Force: Algorithm based on combinatorial search of the set

space– Random: Randomly select of k set descriptors for a level l and

compute their interestingness. Perform multiple such runs.

Page 26: Discovering Interesting Subsets Using Statistical Analysis

Experimental Results• We successfully identified subsets of records with

significantly high service time• We ran the algorithm from level 1 to 5• Level 1 results contain large subsets defined by single

attribute-value pair– We were able to identify tickets of a specific day of week, tickets

from a specific city, to have a significantly high service times than the rest of the tickets

• With higher levels we were able to do finer analysis of ticket properties

• The discovered interesting subsets provided interesting insights for system improvement by finding improvement areas that can have highest impact on the improvement of the overall service time of the system

Page 27: Discovering Interesting Subsets Using Statistical Analysis

Comparison with ISD_BF algorithmNumber of Sets Explored

Page 28: Discovering Interesting Subsets Using Statistical Analysis

Comparison with ISD_R algorithmCoverage and Accuracy

• Coverage = % of ISD_R results covered by ISD– 100% coverage

• Accuracy = Average accuracy with which the ISD algorithm covers the ISD_R set descriptors– 80% to 90% accuracy

Page 29: Discovering Interesting Subsets Using Statistical Analysis

Comparison of ISD_HS with ISD_H

% of ISD_HS results that match with ISD_H % of ISD_H results that are covered by ISD_HS

Page 30: Discovering Interesting Subsets Using Statistical Analysis

Conclusion• We presented algorithms for discovery of interesting

subsets from a given database of records with respect to a given quantitative measure

• We proposed various heuristics to prune the search space

• We presented experimental evaluation of the proposed algorithm by applying it on the service request records of the IT division of a major financial institution

• The discovered interesting subsets prove to be very insightful for the given data set and provide insights for improvement of the business processes

Page 31: Discovering Interesting Subsets Using Statistical Analysis

Future Work

• Strengthening of heuristics to further reduce the search space

• Use full power of SQL commands to systematically explore more complex subsets

• Application of the algorithm on real-life datasets from different domains