jje: inex xml competition bryan clevenger james reed jon mcelroy

19
JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Upload: jasmine-pearson

Post on 16-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

JJE: INEX XML Competition

Bryan ClevengerJames ReedJon McElroy

Page 2: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Introduction

Deal with large size of internet through using better categorization techniques

Goal: Optimize search time by grouping pages using clusters

Wikipedia is the data source

Page 3: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Problem

Take the Wikipedia data and create a clustering algorithm that leads to a the data being clustered.

This creates a reduction in search space for related information.

Page 4: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Solution

If documents contain several similar links then similar data.

Focused on the link data set: Link data: 39484 2039 4952 1029 39

1920 10233 30197

Page 5: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Overall solution

Determine sub-communities in the graph using Max-Flow/Min-Cut community Discovery

Heuristics used to find relevant seeds

Page 6: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Max Flow – Min Cut

Edge Capacity – similar to edge weight. Represents the “amount” of information that can be pushed along.

Flow – The sum of minimum capacity of all paths from one node to another.

Page 7: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Max Flow – Min Cut (cont.)

The flow between two nodes in the same cluster should be larger than flow between two nodes in separate clusters.

Page 8: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Max Flow – Min Cut (cont.)

Page 9: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Max-Flow Community Discovery

Page 10: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Implementation

Page 11: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Implementation (Parsing)

Links parsed into a Graph. Graph: HashMap<Integer,

HashMap<Integer,Integer> Document Id to HashMap of Link Ids

to Capacity. Links structure was created

Links[0] = 3244,2645,791Links[1] = 10293,432,2,1230

...Links[max] = 1012

Page 12: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Implementation (Initialization of Community Seeds)

Using the Links structure, a percentage of nodes with highest links are chosen as seeds

Page 13: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Implementation (Finding Communities)

Idea, why it didn’t work? robots

Page 14: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Implementation (Visualization)

Walrus is an interactive 3D visualization tool that works on large directed graphs.

Input and output Parsing. Grouped clusters by colors.

Page 15: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Results

The INEX links data was composed of 54,000 nodes and 15 million links

Average running time on a DELL Duo Core 2.0 GHz Pentium Laptop to retrieve one cluster was 5.9 hours

Cluster size is between 2-2.5 K

Page 16: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Results

Visual Images of clusters

Page 17: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Conclusion

It worked... kinda. Looks great! See pretty pictures.

Page 18: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

References

[1] Inex 2009 mining track. http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp, October 2009.

[2] The standard maximum flow problem. http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=maxFlow, November 2009.

[3] Walrus - graph visualization tool. http://www.caida.org/tools/visualization/walrus, December 2009.

[4] Mark C. Chu-Carroll. Maximum flow and minimum cut. http://scienceblogs.com/goodmath/2007/08/maximum_flow_

and_minimum_cut_1.php, December 2009.[5] Fordfulkerson algorithm.

http://en.wikipedia.org/wiki/FordFulkersos_algorithm, October 2009.[6] Max-flow Min-cut theorem. http://en.wikipedia.org/wiki/Max-flow_min-cut_theorem, November 2009.

Page 19: JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

Questions?

O really?