jje: inex xml competition bryan clevenger james reed jon mcelroy
TRANSCRIPT
JJE: INEX XML Competition
Bryan ClevengerJames ReedJon McElroy
Introduction
Deal with large size of internet through using better categorization techniques
Goal: Optimize search time by grouping pages using clusters
Wikipedia is the data source
Problem
Take the Wikipedia data and create a clustering algorithm that leads to a the data being clustered.
This creates a reduction in search space for related information.
Solution
If documents contain several similar links then similar data.
Focused on the link data set: Link data: 39484 2039 4952 1029 39
1920 10233 30197
Overall solution
Determine sub-communities in the graph using Max-Flow/Min-Cut community Discovery
Heuristics used to find relevant seeds
Max Flow – Min Cut
Edge Capacity – similar to edge weight. Represents the “amount” of information that can be pushed along.
Flow – The sum of minimum capacity of all paths from one node to another.
Max Flow – Min Cut (cont.)
The flow between two nodes in the same cluster should be larger than flow between two nodes in separate clusters.
Max Flow – Min Cut (cont.)
Max-Flow Community Discovery
Implementation
Implementation (Parsing)
Links parsed into a Graph. Graph: HashMap<Integer,
HashMap<Integer,Integer> Document Id to HashMap of Link Ids
to Capacity. Links structure was created
Links[0] = 3244,2645,791Links[1] = 10293,432,2,1230
...Links[max] = 1012
Implementation (Initialization of Community Seeds)
Using the Links structure, a percentage of nodes with highest links are chosen as seeds
Implementation (Finding Communities)
Idea, why it didn’t work? robots
Implementation (Visualization)
Walrus is an interactive 3D visualization tool that works on large directed graphs.
Input and output Parsing. Grouped clusters by colors.
Results
The INEX links data was composed of 54,000 nodes and 15 million links
Average running time on a DELL Duo Core 2.0 GHz Pentium Laptop to retrieve one cluster was 5.9 hours
Cluster size is between 2-2.5 K
Results
Visual Images of clusters
Conclusion
It worked... kinda. Looks great! See pretty pictures.
References
[1] Inex 2009 mining track. http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp, October 2009.
[2] The standard maximum flow problem. http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=maxFlow, November 2009.
[3] Walrus - graph visualization tool. http://www.caida.org/tools/visualization/walrus, December 2009.
[4] Mark C. Chu-Carroll. Maximum flow and minimum cut. http://scienceblogs.com/goodmath/2007/08/maximum_flow_
and_minimum_cut_1.php, December 2009.[5] Fordfulkerson algorithm.
http://en.wikipedia.org/wiki/FordFulkersos_algorithm, October 2009.[6] Max-flow Min-cut theorem. http://en.wikipedia.org/wiki/Max-flow_min-cut_theorem, November 2009.
Questions?
O really?