discovering interesting subsets using statistical analysis

Discovering Interesting Subsets Using Statistical Analysis

Maitreya Natu and Girish K. PalshikarTata Research Development and Design Centre (TRDDC)

Pune, MH, India, 411013{maitreya.natu, gk.palshikar}@tcs.com

Concept of Interesting Subsets

• In many real-life applications, it is important to automatically discover subsets of records which are interesting with respect to a given measure– Database of customer support tickets: subsets of

tickets that have very high or very low service time– Database of employee satisfaction survey: subsets of

employees that have very high or very low satisfaction index

– Database of sales orders: subsets of orders that took very large or very small time to fulfill the order

Insights from Interesting Subsets

• Interesting subsets provide insights for improving business processes, e.g.,– Identification of• bottlenecks• areas of improvement• ways to increase per-person productivity

– What-if/Impact analysis• Finding the most effective way to improve the overall

service time by x%

ISD Vs. Other Related Work

• Anomaly detection– ISD focuses on finding interesting subsets, rather

than individual interesting records• Top-k heavy hitter analysis– In top-k heavy hitter analysis, each record in top-k

subset has an unusual (high or low) value for the measure.

– We wish to identify common characteristics of the records in the interesting subsets (rather than individual interesting records)

Application on the domain of customer support tickets

• Attributes:– Timestamp-begin, Timestamp-end, Priority, Location, Resource, Problem, Solution, Solution-

provider, etc.– Service-time

• Discovery of interesting subsets of tickets that have very high service times, as compared to the rest of the tickets

Two Central Questions

• How to construct subsets of records?• How to measure interestingness of records?

How to Construct Subsets of Records?

• SQL-like SELECT commands provide an intuitive way for the end-user to characterize and understand subset of records

• We systematically explore subsets of records of a given table by increasingly refining the condition part of the SELECT command


• Attributes of customer support database records:– PR (Priority), CT (Category), AC (Affected City)

• Domain of each attribute– DPR = {L, M, H}, DCT = {A, B}, DAC = {X}

• A descriptor {(PR, L), (AC, ‘New York’)} corresponds to the subset of records selected using– SELECT * from D WHERE PR = L and AC = ‘New York’

• The level of a descriptor is defined as the number of attributes in a descriptor

How to Measure Interestingness of Records?

• A : subset of database D• A’ : D – A, records in D that are not in A• Ф(A) : set of measure values for records in A• Ф(A’) : set of measure values for records in A’• We say A is an interesting subset if Ф(A) is

statistically different from Ф(A’)• E.g.: In customer support database, a subset A

would be interesting is service times of tickets in A are statistically very different from the service times of the rest of the tickets in A’


• More formally, A is an interesting subset of D if the probability distribution of the values in the subset Ф(A) are very different from that of the subset Ф(A’)

• We use statistical hypothesis tests (Student’s t-test) to measure interestingness

• Note that we focus on interestingness of subsets of tickets rather than individual tickets themselves

Student’s t-test• Student’s t-test makes a null hypothesis that the means of two sets

do not differ significantly• Let X and Y be the two set of numbers of sizes n1 and n2• The t-statistic for the unpaired sets X and Y assumes unequal

variance and unequal sizes and tests whether means of the two sets are statistically different

• Denominator is a measure of the variability of the data and is called standard error of difference

• t-test then calculates a p-value which is the probability of obtaining a result as extreme as the one actually observed, given that the null hypothesis is true

• If p-value is below a threshold, the null hypothesis is rejected


• For the performance metric values of each subset of tickets in A and its complement A’, we run the Student’s t-test and compute a t-value and a p-value

• The t-value is positive if the mean of the first subset is larger than the second subset, and negative if smaller

• The p-value provides the probability that the subset A is statistically different from its complement A’

How to Deal with the Large Search Space?

• Search space of all attribute combinations can be very large in real data sets

• We present three heuristics to prune the search space– Size heuristic– Goodness heuristic– p-prediction heuristic


Priority = L Priority = M

Lev

el 1

Stag

e 1

Stag

e 2

Priority = H Priority = I

Priority = LPriority = M

Priority = LPriority = H

Priority = LPriority = I

Priority = MPriority = H

Priority = MPriority = I

Priority = HPriority = I

Priority = LPriority = MPriority = H

Priority = LPriority = MPriority = I

Priority = LPriority = HPriority = I

Priority = MPriority = HPriority = ISt

age

3


• Size heuristic– Subsets on very small sizes can be noisy leading to

incorrect inference of interesting subsets– We apply a threshold Ms and do not explore

subsets with size less than Ms


• Goodness heuristic– In the case of customer support tickets we are

interested in identification of tickets with large service times

– The set of tickets with service time significantly smaller than the rest of the tickets can be pruned

– Prune a set if the t-test result has a t-value < 0 and p-value < Mg


• p-prediction heuristic– A level k subset is built from two level k-1 subsets– We observed that if two level k-1 subsets are

statistically very different mutually, then the corresponding level k subset built from the two sets is likely to be less different from its complement

– The heuristic prevents combination of two subsets that are statistically very different, where the statistical difference is measured by t-test

Accuracy of the p-prediction Heuristic

• For sets with p-value p1 and p2 and mutual p-value p12, the p-value prediction heuristic states that– If p12 < Mp then p3 > min(p1, p2)where p3 is the p-value of the combined set

• Accuracy = % of the number of subset pairs with p12 less than Mp that hold the p-value prediction property

Interesting Subset Discovery Algorithm

• Build a level k subset from two level k-1 subsets– Two level k-1 subsets can be combined that have exactly one

different attribute-value pair– Check the p-prediction heuristic and skip the set if the mutual p-

value of the two level k-1 sets is less than Mp.• Compute the interestingness of the subset by applying t-

test • If the t-value and p-value of the set is above the threshold

of statistical significance, store the set descriptor in the result set R

• Apply the size and goodness heuristic on level k sets to decide if the sets should be used further for building sets of subsequent levels

• Sort the result set R on increasing p-value

Interesting Subset Discovery Algorithm(using sampling)

• ISD algorithm reduces search space using various heuristics, but for very large data sets (in order of millions of records) the search space can still be very large

• We hence propose interesting subset discovery using sampling of the data set


• The algorithm is based on the following observations:– A small number of interesting subsets that give major

insight into functioning and improvement of the system is preferred over a large number• Such sets give immediate actions items for major

improvement– Out of all the interesting subsets, the subsets that

have major impact on the overall system performance are of more importance• Such sets can provide insight into the areas of system

improvement that can have maximum impact


• Proposed algorithm– Take samples of original data set and run ISD

algorithm on the samples– Rank the results of all the runs based on the

number of occurrences of a subset descriptor in results of different samples• The larger the number of occurrences, higher the rank

– If the rank is less than a predefined threshold, then remove the subset from the result

Experimental Results• Experimental setup

– Data set of service request records of the IT division of a major financial institution

– 6000 records– 7 Attributes (PR, AC, ABU, AI, CS, CT, CD) with domain sizes (4,

23, 29, 48, 4, 9, 7) respectively– Each record contains a Service_Time attribute as a performance

metric• We compare the results of ISD algorithm with Brute Force

and Random algorithms– Brute Force: Algorithm based on combinatorial search of the set

space– Random: Randomly select of k set descriptors for a level l and

compute their interestingness. Perform multiple such runs.

Experimental Results• We successfully identified subsets of records with

significantly high service time• We ran the algorithm from level 1 to 5• Level 1 results contain large subsets defined by single

attribute-value pair– We were able to identify tickets of a specific day of week, tickets

from a specific city, to have a significantly high service times than the rest of the tickets

• With higher levels we were able to do finer analysis of ticket properties

• The discovered interesting subsets provided interesting insights for system improvement by finding improvement areas that can have highest impact on the improvement of the overall service time of the system

Comparison with ISD_BF algorithmNumber of Sets Explored

Comparison with ISD_R algorithmCoverage and Accuracy

• Coverage = % of ISD_R results covered by ISD– 100% coverage

• Accuracy = Average accuracy with which the ISD algorithm covers the ISD_R set descriptors– 80% to 90% accuracy

Comparison of ISD_HS with ISD_H

% of ISD_HS results that match with ISD_H % of ISD_H results that are covered by ISD_HS

Conclusion• We presented algorithms for discovery of interesting

subsets from a given database of records with respect to a given quantitative measure

• We proposed various heuristics to prune the search space

• We presented experimental evaluation of the proposed algorithm by applying it on the service request records of the IT division of a major financial institution

• The discovered interesting subsets prove to be very insightful for the given data set and provide insights for improvement of the business processes

Future Work

• Strengthening of heuristics to further reduce the search space

• Use full power of SQL commands to systematically explore more complex subsets

• Application of the algorithm on real-life datasets from different domains

discovering interesting subsets using statistical analysis

Documents