building a mini google high performance computing in ruby

72
Building Mini-Google in Ruby @igrigorik #railsconf http://bit.ly/railsconf-pagerank Building Mini-Google in Ruby Ilya Grigorik @igrigorik

Upload: railsconf

Post on 28-Jan-2015

1.871 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Building Mini-Google in Ruby

Ilya Grigorik

@igrigorik

Page 2: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

postrank.com/topic/ruby

The slides… Twitter My blog

Page 3: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Ruby + MathOptimization

PageRank

IndexingExamplesMisc Fun

Page 4: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank PageRank + Ruby

IndexingExamplesTools

+ Optimization

Page 5: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Consume with care…everything that follows is based on released / public domain info

Page 6: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Search-engine graveyardGoogle did pretty well…

Page 7: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Search pipeline50,000-foot view

Query: Ruby

Results

1. Crawl 2. Index 3. Rank

Page 8: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Query: Ruby

Results

1. Crawl 2. Index 3. Rank

Bah FunInteresting

Page 9: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

circa 1997-1998

CPU Speed 333MhzRAM 32-64MB

Index 27,000,000 documentsIndex refresh once a month~ishPageRank computation several days

Laptop CPU 2.1GhzVM RAM 1GB1-Million page web ~10 minutes

Page 10: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Creating & Maintaining an Inverted Index DIY and the gotchas within

Page 11: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Building an Inverted Index

require 'set'

pages = {"1" => "it is what it is","2" => "what is it","3" => "it is a banana"

}

index = {}

pages.each do |page, content|content.split(/\s/).each do |word|

if index[word]index[word] << page

elseindex[word] = Set.new(page)

endend

end

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

Page 12: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Building an Inverted Index

require 'set'

pages = {"1" => "it is what it is","2" => "what is it","3" => "it is a banana"

}

index = {}

pages.each do |page, content|content.split(/\s/).each do |word|

if index[word]index[word] << page

elseindex[word] = Set.new(page)

endend

end

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

Page 13: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Building an Inverted Index

require 'set'

pages = {"1" => "it is what it is","2" => "what is it","3" => "it is a banana"

}

index = {}

pages.each do |page, content|content.split(/\s/).each do |word|

if index[word]index[word] << page

elseindex[word] = Set.new(page)

endend

end

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

Word => [Document]

Page 14: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Querying the index

# query: "what is banana"p index["what"] & index["is"] & index["banana"]# > #<Set: {}>

# query: "a banana"p index["a"] & index["banana"]# > #<Set: {"3"}>

# query: "what is"p index["what"] & index["is"]# > #<Set: {"1", "2"}>

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

1 32

Page 15: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Querying the index

# query: "what is banana"p index["what"] & index["is"] & index["banana"]# > #<Set: {}>

# query: "a banana"p index["a"] & index["banana"]# > #<Set: {"3"}>

# query: "what is"p index["what"] & index["is"]# > #<Set: {"1", "2"}>

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

1 32

Page 16: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Querying the index

# query: "what is banana"p index["what"] & index["is"] & index["banana"]# > #<Set: {}>

# query: "a banana"p index["a"] & index["banana"]# > #<Set: {"3"}>

# query: "what is"p index["what"] & index["is"]# > #<Set: {"1", "2"}>

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

1 32

Page 17: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Querying the index

# query: "what is banana"p index["what"] & index["is"] & index["banana"]# > #<Set: {}>

# query: "a banana"p index["a"] & index["banana"]# > #<Set: {"3"}>

# query: "what is"p index["what"] & index["is"]# > #<Set: {"1", "2"}>

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

What order?

[1, 2] or [2,1]

Page 18: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Building an Inverted Index

require 'set'

pages = {"1" => "it is what it is","2" => "what is it","3" => "it is a banana"

}

index = {}

pages.each do |page, content|content.split(/\s/).each do |word|

if index[word]index[word] << page

elseindex[word] = Set.new(page)

endend

end

Hmmm?

PDF, HTML, RSS?Lowercase / Upcase?

Compact Index?Stop words?Persistence?

Page 19: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Page 20: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Ferret is a high-performance, full-featured text search engine library written for Ruby

Page 21: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

require 'ferret'include Ferret

index = Index::Index.new()

index << {:title => "1", :content => "it is what it is"}index << {:title => "2", :content => "what is it"}index << {:title => "3", :content => "it is a banana"}

index.search_each('content:"banana"') do |id, score|puts "Score: #{score}, #{index[id][:title]} "

end

> Score: 1.0, 3

Page 22: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

require 'ferret'include Ferret

index = Index::Index.new()

index << {:title => "1", :content => "it is what it is"}index << {:title => "2", :content => "what is it"}index << {:title => "3", :content => "it is a banana"}

index.search_each('content:"banana"') do |id, score|puts "Score: #{score}, #{index[id][:title]} "

end

> Score: 1.0, 3

Hmmm?

Page 23: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

class Ferret::Analysis::Analyzerclass Ferret::Analysis::AsciiLetterAnalyzerclass Ferret::Analysis::AsciiLetterTokenizerclass Ferret::Analysis::AsciiLowerCaseFilterclass Ferret::Analysis::AsciiStandardAnalyzerclass Ferret::Analysis::AsciiStandardTokenizerclass Ferret::Analysis::AsciiWhiteSpaceAnalyzerclass Ferret::Analysis::AsciiWhiteSpaceTokenizerclass Ferret::Analysis::HyphenFilterclass Ferret::Analysis::LetterAnalyzerclass Ferret::Analysis::LetterTokenizerclass Ferret::Analysis::LowerCaseFilterclass Ferret::Analysis::MappingFilterclass Ferret::Analysis::PerFieldAnalyzerclass Ferret::Analysis::RegExpAnalyzerclass Ferret::Analysis::RegExpTokenizerclass Ferret::Analysis::StandardAnalyzerclass Ferret::Analysis::StandardTokenizerclass Ferret::Analysis::StemFilterclass Ferret::Analysis::StopFilterclass Ferret::Analysis::Tokenclass Ferret::Analysis::TokenStreamclass Ferret::Analysis::WhiteSpaceAnalyzerclass Ferret::Analysis::WhiteSpaceTokenizer

class Ferret::Search::BooleanQueryclass Ferret::Search::ConstantScoreQueryclass Ferret::Search::Explanationclass Ferret::Search::Filterclass Ferret::Search::FilteredQueryclass Ferret::Search::FuzzyQueryclass Ferret::Search::Hitclass Ferret::Search::MatchAllQueryclass Ferret::Search::MultiSearcherclass Ferret::Search::MultiTermQueryclass Ferret::Search::PhraseQueryclass Ferret::Search::PrefixQueryclass Ferret::Search::Queryclass Ferret::Search::QueryFilterclass Ferret::Search::RangeFilterclass Ferret::Search::RangeQueryclass Ferret::Search::Searcherclass Ferret::Search::Sortclass Ferret::Search::SortFieldclass Ferret::Search::TermQueryclass Ferret::Search::TopDocsclass Ferret::Search::TypedRangeFilterclass Ferret::Search::TypedRangeQueryclass Ferret::Search::WildcardQuery

Page 24: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

ferret.davebalmain.com/trac

Page 25: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Ranking Results0-60 with PageRank…

Page 26: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Naïve: Term Frequency

index.search_each('content:"the brown cow"') do |id, score|puts "Score: #{score}, #{index[id][:title]} "

end

> Score: 0.827, 3> Score: 0.523, 5> Score: 0.125, 4

Relevance?

3 5 4

the 4 3 5

brown 1 3 1

cow 1 4 1

Score 6 10 7

Page 27: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Naïve: Term Frequency

index.search_each('content:"the brown cow"') do |id, score|puts "Score: #{score}, #{index[id][:title]} "

end

> Score: 0.827, 3> Score: 0.523, 5> Score: 0.125, 4

Skew

3 5 4

the 4 3 5

brown 1 3 1

cow 1 4 1

Score 6 10 7

Page 28: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

TF-IDFTerm Frequency * Inverse Document Frequency

Skew

3 5 4

the 4 3 5

brown 1 3 1

cow 1 4 1

Total # of documents: 10

# of docs

the 6

brown 3

cow 4

Score = TF * IDF

TF = # occurrences / # wordsIDF = # docs / # docs with W

Page 29: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

TF-IDFScore = 0.204 + 0.120 + 0.092 = 0.416

# of docs

the 6

brown 3

cow 4

3 5 4

the 4 3 5

brown 1 3 1

cow 1 4 1

Total # of documents: 10# words in document: 10

Doc # 3 score for ‘the’:4/10 * ln(10/6) = 0.204

Doc # 3 score for ‘brown’:1/10 * ln(10/3) = 0.120

Doc # 3 score for ‘cow’:1/10 * ln(10/4) = 0.092

Page 30: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Frequency Matrix

W1 W2 … … … … … … WN

Doc 1 15 23 …

Doc 2 24 12 …

… … … …

Doc K

Size = N * K * size of Ruby object

Ouch.

Pages = N = 10,000Words = K = 2,000Ruby Object = 20+ bytes

Footprint = 384 MB

Page 31: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

NArrayhttp://narray.rubyforge.org/

NArray is an Numerical N-dimensional Array class (implemented in C)

NArray.new(typecode, size, ...)NArray.byte(size,...)NArray.sint(size,...)NArray.int(size,...)NArray.sfloat(size,...)NArray.float(size,...)NArray.scomplex(size,...)NArray.complex(size,...)NArray.object(size,...)

# create new NArray. initialize with 0.# 1 byte unsigned integer# 2 byte signed integer# 4 byte signed integer# single precision float# double precision float# single precision complex# double precision complex# Ruby object

Page 32: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

NArrayhttp://narray.rubyforge.org/

NArray is an Numerical N-dimensional Array class (implemented in C)

Page 33: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRankthe google juice

Links as votes

Problem: link gaming

Page 34: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Random Surferpowerful abstraction

Follow link from page he/she is currently on.

Teleport to a random location on the web.

P = 0.85

P = 0.15

Page 35: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Surfin’rinse & repeat, ad naseum

Follow link from page he/she is currently on.

Teleport to a random location on the web.

Page K

Page N Page M

Page 36: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Surfin’rinse & repeat, ad naseum

On Page P, clicks on link to K

P = 0.15

P = 0.85

On Page K clicks on link to M

On Page M teleports to X

P = 0.85

Page 37: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Analyzing the Web Graphextracting PageRank

P = 0.6

N

MK

X

P = 0.15

P = 0.20P = 0.05

Page 38: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

What is PageRank?It’s a scalar!

Page 39: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

What is PageRank?it’s a probability!

P = 0.6

N

MK

X

P = 0.15

P = 0.20P = 0.05

P = 0.6

P = 0.15

P = 0.20P = 0.05

P = 0.6

P = 0.15

P = 0.20P = 0.05

Page 40: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

What is PageRank?it’s a probability!

P = 0.6

N

MK

X

P = 0.15

P = 0.20P = 0.05

P = 0.6

P = 0.15

P = 0.20P = 0.05

Higher Pr, Higher Importance?

Page 41: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Teleportation?sci-fi fans, … ?

Page 42: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Reasons for teleportationenumerating edge cases

N

M

K

X

1. No in-links!

M

2. No out-links!

3. Isolated Web

Page 43: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Exploring Graphsgratr.rubyforge.com

•Breadth First Search•Depth First Search•A* Search •Lexicographic Search •Dijkstra’s Algorithm •Floyd-Warshall•Triangulation and Comparability detection

require 'gratr/import'

dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6]

dg.directed? # truedg.vertex?(4) # truedg.edge?(2,4) # truedg.vertices # [5, 6, 1, 2, 3, 4]

Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5]Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4]

Page 44: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Teleportationprobabilities

N

M

K

X

M

P(T) = 0.03

P(T) = 0.03

P(T) = 0.03

P(T) = 0.03

P(T) = 0.03

P(T) = 0.15 / # of pagesP(T) = 0.03

Page 45: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank: Simplified Mathematical Def’ncause that’s how we roll

𝐿 = 𝑇 =

0.15𝑁

⋮0.15

𝑁

Assume the web is N pages bigAssume that probability of teleportation (t) is 0.15, and following link (s) is 0.85Assume that teleportation probability (E) is uniformAssume that you start on any random page (uniform distribution L), then

Then after one step, the probability your on page X is:

𝐿 ∗ 𝑠𝐺 + 𝑡𝐸

𝐿 ∗ (0.85 ∗ 𝐺 + 0.15 ∗ 𝐸)

Page 46: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

G = The Link Graphginormous and sparse

1 2 … … N

1 1 0 … … 0

2 0 1 … … 1

… … … … … …

… … … … … …

N 0 1 … … 1

Link Graph No link from 1 to N

Huge!

Page 47: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

G as a dictionarymore compact…

{

"1" => [25, 26],

"2" => [1],

"5" => [123,2],

"6" => [67, 1]

}

Page

Links to…

Page 48: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Computing PageRankthe tedious way

Follow link from page he/she is currently on.

Teleport to a random location on the web.

Page K

Page 49: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Computing PageRankin one swoop

Identity matrix

Don’t trust me! Verify it yourself!

𝑞 = 𝑡 𝐼 − 𝑠𝐺 −1𝐸 = 𝑃1

⋮𝑃𝑛

Page 50: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Enough hand-waving, dammit!show me the code

Page 51: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Birth of EM-Proxyflash of the obvious

Hot, Fast, Awesome

Page 52: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Hot, Fast, Awesome

http://rb-gsl.rubyforge.org/

Click there! … Give yourself a weekend.

Page 53: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Click there! … Give yourself a weekend. http://ruby-gsl.sourceforge.net/

Page 54: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank in Ruby6 lines, or less

require "gsl"include GSL

# INPUT: link structure matrix (NxN)# OUTPUT: pagerank scoresdef pagerank(g)

raise if g.size1 != g.size2

i = Matrix.I(g.size1) # identity matrixp = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

s = 0.85 # probability of following a linkt = 1-s # probability of teleportation

t*((i-s*g).invert)*pend

Verify NxN

Page 55: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank in Ruby6 lines, or less

require "gsl"include GSL

# INPUT: link structure matrix (NxN)# OUTPUT: pagerank scoresdef pagerank(g)

raise if g.size1 != g.size2

i = Matrix.I(g.size1) # identity matrixp = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

s = 0.85 # probability of following a linkt = 1-s # probability of teleportation

t*((i-s*g).invert)*pend

Constants…

Page 56: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank in Ruby6 lines, or less

require "gsl"include GSL

# INPUT: link structure matrix (NxN)# OUTPUT: pagerank scoresdef pagerank(g)

raise if g.size1 != g.size2

i = Matrix.I(g.size1) # identity matrixp = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

s = 0.85 # probability of following a linkt = 1-s # probability of teleportation

t*((i-s*g).invert)*pend

PageRank!

Page 57: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Ex: Circular Webtesting intuition…

N

K

X P = 0.33

pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]])> [0.33, 0.33, 0.33]

P = 0.33

P = 0.33

Page 58: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Ex: All roads lead to Ktesting intuition…

N

K

X P = 0.07

pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]])> [0.05, 0.07, 0.87]

P = 0.87

P = 0.05

Page 59: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank + Ferretawesome search, ftw!

Page 60: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

require 'ferret'include Ferret

index = Index::Index.new()

index << {:title => "1", :content => "it is what it is", :pr => 0.05 }index << {:title => "2", :content => "what is it", :pr => 0.07 }index << {:title => "3", :content => "it is a banana", :pr => 0.87 }

1

3

2 P = 0.07

P = 0.87

P = 0.05

Store PageRank

Page 61: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

index.search_each('content:"world"') do |id, score|puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"

end

puts "*" * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"

end

# Score: 0.267119228839874, 3 (PR: 0.87)# Score: 0.17807948589325, 1 (PR: 0.05)# Score: 0.17807948589325, 2 (PR: 0.07)# ***********************************# Score: 0.267119228839874, 3, (PR: 0.87)# Score: 0.17807948589325, 2, (PR: 0.07)# Score: 0.17807948589325, 1, (PR: 0.05)

TF-IDF Search

Page 62: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

index.search_each('content:"world"') do |id, score|puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"

end

puts "*" * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"

end

# Score: 0.267119228839874, 3 (PR: 0.87)# Score: 0.17807948589325, 1 (PR: 0.05)# Score: 0.17807948589325, 2 (PR: 0.07)# ***********************************# Score: 0.267119228839874, 3, (PR: 0.87)# Score: 0.17807948589325, 2, (PR: 0.07)# Score: 0.17807948589325, 1, (PR: 0.05)

PageRank FTW!

Page 63: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

index.search_each('content:"world"') do |id, score|puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"

end

puts "*" * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"

end

# Score: 0.267119228839874, 3 (PR: 0.87)# Score: 0.17807948589325, 1 (PR: 0.05)# Score: 0.17807948589325, 2 (PR: 0.07)# ***********************************# Score: 0.267119228839874, 3, (PR: 0.87)# Score: 0.17807948589325, 2, (PR: 0.07)# Score: 0.17807948589325, 1, (PR: 0.05)

Google

Others

Page 64: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Search*: Graphs are ubiquitous!PageRank is a general purpose hammer

Page 65: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank + Social GraphGitHub

Username GitCred

==============================

37signals 10.00

imbriaco 9.76

why 8.74

rails 8.56

defunkt 8.17

technoweenie 7.83

jeresig 7.60

mojombo 7.51

yui 7.34

drnic 7.34

pjhyett 6.91

wycats 6.85

dhh 6.84

http://bit.ly/3YQPU

Page 66: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank + Social GraphTwitter

Hmm…

Analyze the social graph:- Filter messages by ‘TwitterRank’- Suggest users by ‘TwitterRank’- …

Page 67: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank + Product GraphE-commerce

Link items purchased in same cart… Run PR on it.

Page 68: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank = Powerful Hammeruse it!

Page 69: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Personalizationhow would you do it?

Page 70: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank + Personalizationcustomize the teleportation vector

𝑇 =

0.15𝑁

⋮0.15

𝑁

Teleportation distribution doesn’t

have to be uniform!

yahoo.com is my homepage!

Page 71: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Gaming PageRankfor fun and profit (I don’t endorse it)

Make pages with links!

http://bit.ly/pagerank-spam

Page 72: Building A Mini Google  High Performance Computing In Ruby

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Questions?

The slides… Twitter My blog

Slides: http://bit.ly/railsconf-pagerank

Ferret: http://bit.ly/ferretRB-GSL: http://bit.ly/rb-gsl

PageRank on Wikipedia: http://bit.ly/wp-pagerankGaming PageRank: http://bit.ly/pagerank-spam

Michael Nielsen’s lectures on PageRank:http://michaelnielsen.org/blog