the similarity graph: analyzing database access patterns within a company phd preliminary...
TRANSCRIPT
![Page 1: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/1.jpg)
The Similarity Graph: Analyzing Database Access Patterns Within A
Company PhD Preliminary Examination
Robert Searles
CommitteeJohn Cavazos and Stephan Bohacek
University of DelawareComputer and Information Sciences
In collaboration with JP Morgan Chase & Co.
Computer and Information Sciences
![Page 2: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/2.jpg)
Computer and Information Sciences 2
![Page 3: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/3.jpg)
What are Many Employees Doing?
• SQL queries!
• Some employees access the database
• Queries reveal information about employees
Computer and Information Sciences 3
![Page 4: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/4.jpg)
Motivation
• Workforce efficiency
• Optimize Database
• Redundant work?
• New collaborations?
Computer and Information Sciences 4
![Page 5: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/5.jpg)
Solution
• Find computationally friendly representations of queries– Directed graphs (trees)
• Graph similarity techniques and Visualization– Find relationships between users– Find relationships between business units
Computer and Information Sciences 5
![Page 6: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/6.jpg)
What do SQL Queries Look Like?
• A query is a string of text
• SELECT <clause> FROM <clause> WHERE <clause>
Computer and Information Sciences 6
![Page 7: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/7.jpg)
SQL Query Example 1
• SELECT A, B, C AS c, function1(D) AS d, E AS e, F AS f FROM table1_name WHERE table1_name.c1 ≤ ’STR’ and table1_name.c1 < ’STR’ AND (G=’STR’ AND C IN (’STR’, ’STR’ ))
Computer and Information Sciences 7
![Page 8: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/8.jpg)
SQL Query Example 2
• SELECT A, H, E, C, F, G, I, D FROM table1_name WHERE table1_name.c1 ≤ ’STR’ and table1_name.c1 < ’STR’ AND (I>function2(p1, ’STR’) AND I≤
function2(p1, ’STR’))
Computer and Information Sciences 8
![Page 9: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/9.jpg)
Similarity Between SQL Queries
• SELECT A, B, C AS c, function1(D) AS d, E AS e, F AS f FROM table1_name WHERE table1_name.c1 ≤ ’STR’ and table1_name.c1 < ’STR’ AND (G=’STR’ AND C IN (’STR’, ’STR’ ))
• SELECT A, H, E, C, F, G, I, D FROM table1_name WHERE table1_name.c1 ≤ ’STR’ and table1_name.c1 < ’STR’ AND (I>function2(p1, ’STR’) AND I≤
function2(p1, ’STR’))
Computer and Information Sciences 9
![Page 10: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/10.jpg)
Representing SQL Queries
• Represent as graphs (parse trees)– No cycles
• Graphs are a more powerful representation– Similarity calculations– Structural relationships
• Graphs are visually appealing
Computer and Information Sciences 10
![Page 11: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/11.jpg)
SQL Parse Tree Example
Computer and Information Sciences 11
SELECT A,B FROM C
![Page 12: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/12.jpg)
Weisfeiler-Lehman
• Modified to work on ordered trees
– Relabel trees according to global alphabet
– Look for common labels and/or structures
– Perform subtree analysis on pairs of trees
Computer and Information Sciences 12
![Page 13: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/13.jpg)
Global Alphabet• T_SELECT 1• SELECT 2• T_COLUMN_LIST 3• T_FROM 4• T_SELECT_COLUMN 5• , 6• FROM 7• A 8• B 9• C 10
Computer and Information Sciences 13
![Page 14: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/14.jpg)
SELECT A,B FROM C
Computer and Information Sciences 14
![Page 15: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/15.jpg)
Example
• SELECT A, B, C AS c, function1(D) AS d, E AS e, F AS f FROM table1 name WHERE table1 name.c1 ≤ ’STR’ and table1 name.c1 < ’STR’ AND (G=’STR’ AND C IN (’STR’, ’STR’ ))
• SELECT A, H, E, C, F, G, I, D FROM table1 name WHERE table1 name.c1 ≤ ’STR’ and table1 name.c1 < ’STR’ AND (I>function2(p1, ’STR’) AND I≤
function2(p1, ’STR’))
Computer and Information Sciences 15
![Page 16: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/16.jpg)
Computer and Information Sciences 16
![Page 17: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/17.jpg)
Tree Similarity Algorithm
• Modified version of Weisfeiler-Lehman graph kernel• Walk over nodes in trees T and T’
– For each pair of nodes• Count number of matching subtrees (using
labels)• For a match, multiply by the height of the
subtree
• NormalizeComputer and Information
Sciences 17
![Page 18: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/18.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 18
Subtree Count = 0
![Page 19: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/19.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 19
Subtree Count = 0
![Page 20: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/20.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 20
Subtree Count = 0
![Page 21: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/21.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 21
Subtree Count = 0
![Page 22: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/22.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 22
Subtree Count = 2
![Page 23: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/23.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 23
Subtree Count = 2
![Page 24: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/24.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 24
Subtree Count = 2
![Page 25: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/25.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 25
Subtree Count = 2
![Page 26: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/26.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 26
Subtree Count = 2
![Page 27: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/27.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 27
Subtree Count = 3
![Page 28: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/28.jpg)
Tree Similarity Algorithm
Computer and Information Sciences 28
Subtree Count = 3
![Page 29: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/29.jpg)
User Similarity
Computer and Information Sciences 29
![Page 30: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/30.jpg)
User Similarity and Betweenness Centrality
• Compare collections of queries from each individual
• User Similarity– Find strong similarities between pairs of
individuals
• Betweenness Centrality– Find individuals who share commonalities with
many other users
Computer and Information Sciences 30
![Page 31: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/31.jpg)
User Similarity
• Let M, N denote the number of queries submitted by Users A, B respectively.
• Note that we calculate (sim(A B) + sim(B A))/2 in order to obtain a symmetrical matrix.
Computer and Information Sciences 31
![Page 32: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/32.jpg)
User Similarity
Computer and Information Sciences 32
![Page 33: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/33.jpg)
User Similarity
Computer and Information Sciences 33
![Page 34: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/34.jpg)
User Similarity
Computer and Information Sciences 34
![Page 35: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/35.jpg)
User Similarity
Computer and Information Sciences 35
![Page 36: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/36.jpg)
User Similarity
• Create similarity matrix
• Symmetrical
• Visualize
Computer and Information Sciences 36
![Page 37: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/37.jpg)
Betweenness Centrality
• How central a node is in a graph/network
• Tells us which users are similar to many other users– Versatile employee– Common underlying element
Computer and Information Sciences 37
![Page 38: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/38.jpg)
Workflow
Computer and Information Sciences 38
![Page 39: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/39.jpg)
Experimental Results
• 2 datasets (provided by JP Morgan Chase & Co.)
• Anonymized SQL queries
• Visualization– User similarity– Betweenness centrality– Heat Map
Computer and Information Sciences 39
![Page 40: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/40.jpg)
Dataset 1
• 49,655 queries (provided by JP Morgan Chase & Co.)
• 427 users
• Organized into 22 business units
Computer and Information Sciences 40
![Page 41: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/41.jpg)
Computer and Information Sciences 41
![Page 42: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/42.jpg)
Computer and Information Sciences 42
![Page 43: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/43.jpg)
Dataset 2
• 787,732 queries (provided by JP Morgan Chase & Co.)
• 92 users
• Organized into 6 business units
Computer and Information Sciences 43
![Page 44: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/44.jpg)
Computer and Information Sciences 44
![Page 45: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/45.jpg)
Heat Map
• Shows all the similarities uncovered in dataset 2
• Users sorted according to business unit
• Red = similar, Yellow = not similar
Computer and Information Sciences 45
![Page 46: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/46.jpg)
Computer and Information Sciences 46
![Page 47: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/47.jpg)
Challenges
• Volume of data– 49,655 x 49,655 = 10GB approx.– 787,732 x 787,732 = 2.5TB approx.
• Computational cost– User similarity is quadratic in the number of
queries.– Calculating user similarity matrix is n4
Computer and Information Sciences 47
![Page 48: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/48.jpg)
Acceleration
• Compute user similarity in parallel– No data dependencies
• If using accelerator, compute tree similarities in parallel
• OpenMP
Computer and Information Sciences 48
![Page 49: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/49.jpg)
Experimental Setup
• Machine: 16 GB of memory and 2x AMD Opteron 6320 CPUs, 8 cores per CPU clocked at 1.4 GHz
• Dataset 1: 49,655 queries
• Dataset 2: 787,732 queries
Computer and Information Sciences 49
![Page 50: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/50.jpg)
Acceleration
Computer and Information Sciences 50
![Page 51: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/51.jpg)
Contributions
• Created an algorithm that was used to measure the similarity between employees in a workforce (JP Morgan)
• Designed a scalable, encoding-based algorithm to measure the similarity between SQL queries
• Used these algorithms to create a visual representation of employee similarity across a workforce
Computer and Information Sciences 51
![Page 52: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/52.jpg)
Conclusion
• Developed a framework for calculating similarity between users/analysts
• Discovered similarities in the data
• Found similar users in different parts of the company
Computer and Information Sciences 52
![Page 53: The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0141a28abf838ccd72e/html5/thumbnails/53.jpg)
Acknowledgement
• Special thanks to JP Morgan Chase & Co.
• Collaboration on real-world problem
• Provided anonymized data
Computer and Information Sciences 53