introduction to graphs
TRANSCRIPT
Introduction toGraphs
15-111
Advanced Programming
7/31/2009 1
Advanced Programming Concepts/Data Structures
Ananda Gunawardena
Introduction • Many real world problems can be modeled
using graphs
– Airline Route Map
• What is the fastest way to get from Pittsburgh to St Louis?
• What is the cheapest way to get from Pittsburgh to St Louis?
7/31/2009 4
– Electric Circuits
• Circuit elements - transistors, resistors, capacitors
• is everything connected together?
– Depends on interconnections (wires)
• If this circuit is built will it work?
– Depends on wires and objects they connect.
Graphs• More applications
– Job Scheduling
• Interconnections indicate which jobs to be performed before others
• When should each task be performed
• All these questions can be answered
7/31/2009 5
• All these questions can be answered
using a mathematical structure named a
“graph”. We will answer the questions
– what are graphs?
– what are their basic properties?
Graph Definitions• Graph
– A set of vertices(nodes) V = {v1, v2, …., vn}
– A set of edges(arcs) that connects the vertices E={e1, e2,
…, em}
– Each edge ei is a pair (v, w) where v, w in V
– |V| = number of vertices (cardinality)
– |E| = number of edges
• Graphs can be
7/31/2009 6
• Graphs can be
– directed (order (v,w) matters)
– Undirected (order of (v,w) doesn’t matter)
• Edges can be
– weighted (cost associated with the edge)
– eg: Neural Network, airline route map(vanguard airlines)
Graph Representation• How do we represent a graph internally?
• Two ways
– adjacency matrix
– Adjacency list
• Adjacency Matrix
7/31/2009 7
• Adjacency Matrix
– Use matrix entries to represent edges in the
graph
• Adjacency List
– Use an array of lists to represent edges in the
graph (we will discuss this later)
Adjacency Matrix• Adjacency Matrix
– For each edge (v,w) in E, set A[v][w] = edge_cost
– Non existent edges with logical infinity
• Cost of implementation
– O(|V|2) time for initialization
7/31/2009 8
– O(|V| ) time for initialization
– O(|V|2) space
• ok for dense graphs
• unacceptable for sparse graphs
Adjacency List• Adjacency List
– Ideal solution for sparse graphs
– For each vertex keep a list of all adjacent vertices
– Adjacent vertices are the vertices that are connected to the vertex
directly by an edge.
– Example
7/31/2009 9
List 0
List 1
List 2
1 2
2 0 1
1
Adjacency List
• The number of list nodes equals to number of edges
– O(|E|) space
• Space is also required to store the lists
– O(|V|) for |V| lists
• Note that the number of edges is at least round(|V|/2)
7/31/2009 10
• Note that the number of edges is at least round(|V|/2)
– assuming each vertex is in some edge
– Therefore disregard any O(|V|) term when O(|E|) is
present
• Adjacency list can be constructed in linear time (wrt to
edges)
Breadth First Traversal
• Algorithm
– Start from any node in the graph
– Traverse its neighbors (nodes that are directly
connected to it) using some heuristic
7/31/2009 11
connected to it) using some heuristic
– Next traverse the neighbors of the neighbors
etc.. Until some limit is reach or all the nodes
in the graph are visited
– Use a queue to perform the breadth first
traversal
Depth First Traversal
• Algorithm
– Start from any node in the graph
– Traverse deeper and deeper until dead end
– Back track and traverse other nodes that are
7/31/2009 12
– Back track and traverse other nodes that are
not visited
– Use a stack to perform the depth first traversal
Web Algorithms• Search
– Google, MSN, Altavista
• Image search– games
• Routing
7/31/2009 15
• Distributed Computing
• Shortest Path Algorithms– Google Maps, MapQuest
• Semantic Web– XML metadata
• Etc.
Building a Search Engine
• Crawl the web
• Build a web index
• Then when we build/search, we may have to sort the index
7/31/2009 17
– Google sorts more than 100 billion index
items
• Novel algorithms, novel data structures, distributed
computing
Web Crawlers
� Start with an initial page P0. Find URLs on P
0 and
add them to a queue
� When done with P0, pass it to an indexing
program, get a page P1from the queue and repeat
� Can be specialized (e.g. only look for email
7/31/2009 21
� Can be specialized (e.g. only look for email addresses)
� Issues
� Which page to look at next? (Special subjects, recency)
� How deep within a site do you go (depth search)?
� How frequently to visit pages?
So, why Spider the Web?
� Refresh Collection by deleting dead links
� OK if index is slightly smaller
� Done every 1-2 weeks in best engines
7/31/2009 22
� Done every 1-2 weeks in best engines
� Finding new sites
� Respider the entire web
� Done every 2-4 weeks in best engines
Cost of Spidering
� Spider can (and does) run in parallel on
hundreds of severs
� Very high network connectivity (e.g. T3 line)
7/31/2009 23
� Servers can migrate from spidering to query
processing depending on time-of-day load
� Running a full web spider takes days even with
hundreds of dedicated servers
Indexing
� Arrangement of data (data structure) to permit fast searching
� Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
7/31/2009 24
ant cat dog eel fox hen hog pig sow yak
� Sorting helps. Why?
� Permits binary search. About log2n probes into list
� log2(1 billion) ~ 30
� Permits interpolation search. About log2(log
2n)
probes
� log2log
2(1 billion) ~ 5
Inverted Files
A file is a list of words by position
- First entry is the word in position 1 (first word)
- Entry 4562 is the word in position 4562 (4562nd word)
- Last entry is the last word
An inverted file is a list of positions by word!
POS
1
10
20
30
36
FILE
7/31/2009 25
a (1, 4, 40)
entry (11, 20, 31)
file (2, 38)
list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
4562 (21, 27)
INVERTED FILE
Inverted Files for Multiple Documents
WORD NDOCS PTR
jezebel 20
jezer 3
jezerit 1
jeziah 1
34 6 1 118 2087 3922 3981 5002
44 3 215 2291 3010
56 4 5 22 134 992
DOCID OCCUR POS 1 POS 2 . . .
566 3 203 245 287
“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .
LEXICON
7/31/2009 26
107 4 322 354 381 405
232 6 15 195 248 1897 1951 2192
677 1 481
713 3 42 312 802
jeziah 1
jeziel 1
jezliah 1
jezoar 1
jezrahliah 1
jezreel 39jezoar
566 3 203 245 287
67 1 132
. . .WORD
INDEX
Ranking (Scoring) Hits
� Hits must be presented in some order
� What order?
� Relevance, recency, popularity, reliability, alphabetic?
� Some ranking methods
7/31/2009 27
� Presence of keywords in title of document
� Closeness of keywords to start of document
� Frequency of keyword in document
� Link popularity (how many pages point to this one)