ranking ida mele. introduction the set of software components for the management of large sets of...

35
Ranking Ida Mele

Upload: janis-cross

Post on 01-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Ranking

Ida Mele

Introduction

• The set of software components for the management of large sets of data is made of:• MG4J• Fastutil• the DSI Utilities• Sux4J• WebGraph• the LAW software

• These software components have been developed by the DSI of the University of Milan

Ida Mele Ranking 2

Fastutil

• Fastutil 6 is a free software, developed in Java• Technical requirement: • Java >= 6

• Useful links:• http://fastutil.di.unimi.it/ • http://fastutil.di.unimi.it/docs/

Ida Mele Ranking 3

Fastutil

• Fastutil extends Java Collections, and it provides:• Type-specific maps, sets, and lists• Priority queues with a small memory footprint and

fast access and insertion• 64-bit arrays, sets, and lists• Fast I/O classes for text and binary files

Ida Mele Ranking 4

Fastutil: Advantages

• Advantages in using Fastutil:• Classes of Fastutil are implemented in order to

work on huge collections of data in an efficient way

• Fastutil provides a new set of classes to deal with collections whose size exceeds 231

Ida Mele Ranking 5

Fastutil: Advantages

• Advantages in using Fastutil:• There are additional features (e.g., bidirectional

iterators) that are not available in the standard classes

• Classes can be plugged into existing code, because they implement their standard counterpart (e.g., Map is used for maps)

Ida Mele Ranking 6

Fastutil: Big Arrays

• BigArrays: This class provides static methods and objects for working with big arrays

• Big arrays are arrays-of-arrays. For example, a big array of integers has type int[][]

• Methods handle these arrays-of-arrays as if they are monodimensional arrays with 64-bit indices

• The length of a big array is bounded by Long.MAX_VALUE rather than Integer.MAX_VALUE

Ida Mele Ranking 7

Fastutil: Big Arrays

• Given a big array a, a[0], a[1], … a[n] are called segments. Each one has length SEGMENT_SIZE (the last segment can have a smaller size)

• Each index i is associated with a segment and a displacement into the segment• Methods segment/displacement compute the

segment/displacement associated with a given index• Method index receives the segment and the displacement and

returns the corresponding index• Methods get/set allow to return/set the value of a given

element in the big array

Ida Mele Ranking 8

Fastutil Big Arrays – example

• We want to scan the big array a• First solution:

for( int s = 0; s < a.length; s++ ) {final int[] t = a[ s ];for( int d = 0; d < t.length; d++ ) {

//do something with t[ d ] }

}

Ida Mele Ranking 9

Fastutil Big Arrays – example

• Second solution:for( int s = a.length; s-- != 0; ) {

final int[] t = a[ s ];for( int d = t.length; d-- != 0; ) {

//do something with t[ d ] }

}

Ida Mele Ranking 10

Fastutil Big Arrays – example

• Third solution:for( int s = a.length; s-- != 0; ) {

final long[] t = a[ s ]; for( int d = t.length; d-- != 0; )

t[d] = index( s, d );}

• We can use the index method, which returns the index associated with a segment and displacement

Ida Mele Ranking 11

Fastutil: Big Data Structures

• Fastutil provides classes also for other data structures:• BigList: a list with indices. The instances of this

class implement the same semantics of traditional List

• HashBigSet: the instances of this class use a hash table to represent a big set. The number of elements in the set is limited only by the amount of core memory

Ida Mele Ranking 12

Dsiutils

• The DSI utilities are a mishmash of classes• Free software• Developed in Java• Useful links:• http://dsiutils.di.unimi.it/• http://dsiutils.di.unimi.it/docs/

Ida Mele Ranking 13

Dsiutils: MultipleString

• In large-scale text indexing we want to use a mutable string that, once frozen, can be used in the same optimized way of an immutable string

• In Java we have String and StringBuffer, which can be used for immutable and mutable strings respectively

• The solution is MultipleString• MultipleString does not need synchronization

Ida Mele Ranking 14

Dsiutils: Packages

• Some important packages:• it.unimi.dsi.bits contains main classes for

manipulating bits. For example, the class BitVectors provides static methods and objects that do useful things with bit vectors

• it.unimi.dsi.compression provides word-based compression/decompression classes

• it.unimi.dsi.util offers implementations of BloomFilters, PrefixMaps, StringMaps, BinaryTries and others

Ida Mele Ranking 15

WebGraph

• WebGraph is a framework for graph compression

• It exploits modern compression techniques to manage very large graphs

• Useful links:• http://webgraph.di.unimi.it/• http://webgraph.di.unimi.it/docs/

Ida Mele Ranking 16

WebGraph

• WebGraph provides:• ζ-codes, which are suitable for storing web graphs• Algorithm for compressing the graph that exploit

gap compression as well as ζ-codes. The parameters provide different tradeoffs between access speed and compression ratio

• Algorithms to access to compressed graphs without decompression. The lazy techniques delay the decompression until it is necessary

Ida Mele Ranking 17

WebGraph: Classes

• Some important classes:• ImmutableGraph is an abstract class representing

an immutable graph • BVGraph allows to store and access web graphs in

a compressed form• ASCIIGraph is used to store the graph in a human-

readable ASCII format

Ida Mele Ranking 18

WebGraph: Classes

• Some important classes: • ArcLabelledImmutableGraph is an abstract

implementation of a graph with labeled arcs • Transform returns the transformed version of an

immutable graph. We can use the transpose method of this class if we want to create the transpose graph

Ida Mele Ranking 19

LAW

• Java software developed by the Laboratory for Web Algorithms

• It is free and contains several implementations of the Pagerank algorithm

• Useful links:• http://law.di.unimi.it/software.php• http://law.di.unimi.it/software/docs/index.html

Ida Mele Ranking 20

LAW: PageRank

• PageRank of the package it.unimi.dsi.law.rank is an abstract class that defines methods and attributes for the PageRank algorithm

• Provided features:• we can set the preference vectors• we can set the damping factor• we can program stopping criteria• step-by-step execution• reusability

Ida Mele Ranking 21

Exercise

• Download the archive with libraries: lib.zip • Download the files:• set-classpath.sh • example• Text2ASCII.class and PrintRanks.class

available at: http://www.dis.uniroma1.it/~mele/WebIR.html• Set the classpath using the command:

source set-classpath.sh

Ida Mele Ranking 22

Build the Graph: Step1

• Step 1 - Create the file in the format ASCIIGraph using the command: java Text2ASCII example

• Output:example.graph-txt: the first line contains the number of

nodes (e.g., n). The following n lines contain the list of out-neighbours of the nodes. In particular, the line i-th contains the successors of the node i, sorted in an increasing order and separated by a space

Ida Mele Ranking 23

output: example.graph-txt

Ida Mele Ranking 24

Build the Graph: Step1

101 8 94 7 91 3 4 5 6 7 8 91 4 5 6 9

1 211 2 3 4 55 90 1 3 4 6

101 8 94 7 91 3 4 5 6 7 8 91 4 5 6 9

1 211 2 3 4 55 90 1 3 4 6

0 10 80 91 4 1 7 1 92 12 32 42 5…

0 10 80 91 4 1 7 1 92 12 32 42 5…

input: example

Text2ASCIIText2ASCII

more example.graph-txt

Ida Mele Ranking 25

Build the Graph: Step1

101 8 94 7 91 3 4 5 6 7 8 91 4 5 6 9

1 211 2 3 4 55 90 1 3 4 6

101 8 94 7 91 3 4 5 6 7 8 91 4 5 6 9

1 211 2 3 4 55 90 1 3 4 6

Num of nodes

Lists of successors

0123456789

0123456789

Node id

.

.

.

Build the Graph: Step2

• We can use the main method of the BVGraph class to load and compress an ImmutableGraph

• The compressed graph is described by:• basename.graph: the graph file. It contains the successor

lists, one for each node. Each list is a sequence of natural number that are coded as sequence of bits in a efficient way

• basename.offsets: the offset file. It stores the offset for each node of the graph

• basename.properties: the file with properties and statistics

Ida Mele Ranking 26

Build the Graph: Step2

• Step 2 - Conversion from the ASCIIGraph to the BVGraph:java it.unimi.dsi.big.webgraph.BVGraph -g ASCIIGraph example exampleBV

• Output:• exampleBV.graph• exampleBV.offsets• exampleBV.properties

Ida Mele Ranking 27

more exampleBV.properties

Ida Mele Ranking 28

…compratio=1,89bitsforblocks=22residualarcs=15version=0…nodes=10compressionflags=intervalisedarcs=10bitspernode=16,8arcs=34…

…compratio=1,89bitsforblocks=22residualarcs=15version=0…nodes=10compressionflags=intervalisedarcs=10bitspernode=16,8arcs=34…

Build the Graph: Step2

Compute PageRank

• To compute the PageRank we can use the following implementations:• PowerMethod• PageRankPowerSeries• GaussSeidel• Jacobi

• The output is made of 2 files:• basename.ranks: binary file with the results of

computation• basename.properties: text files with general info

Ida Mele Ranking 29

Compute PageRank: Step 1

• Step 1 - We use the main method of the class PageRankPowerMethod by issuing the following command:java it.unimi.dsi.law.rank.PageRankPowerMethod exampleBV examplePR

• Output:• examplePR.ranks• examplePR.properties

Ida Mele Ranking 30

more examplePR.properties

Ida Mele Ranking 31

rank.alpha = 0.85rank.stronglyPreferential = falsemethod.numberOfIterations = 12method.norm.type = INFTYmethod.norm.value = 8.396275630317973E-7graph.nodes = 10graph.fileName = example

rank.alpha = 0.85rank.stronglyPreferential = falsemethod.numberOfIterations = 12method.norm.type = INFTYmethod.norm.value = 8.396275630317973E-7graph.nodes = 10graph.fileName = example

Compute PageRank: Step 1

Compute PageRank: Step 2

• Step 2 – Print the scores• The file .ranks is a binary file with the scores of

the nodes, so we can print PageRank scores by using the class PrintRanks:java PrintRanks examplePR.ranks > ranks

• Output:ranks: file with n lines, one for each node. The i-th

line contains the score of node number i

Ida Mele Ranking 32

Compute PageRank: Step 2

more ranks

Ida Mele Ranking 33

0.05156599403615980.201978506316694950.079826578179069640.075877858304762110.146004576836513080.086085011918961270.072946886114660640.09311949208285820.050502411521725270.14209268468859523

0.05156599403615980.201978506316694950.079826578179069640.075877858304762110.146004576836513080.086085011918961270.072946886114660640.09311949208285820.050502411521725270.14209268468859523

PageRank values

0123456789

0123456789

Node id

.

.

.

Compute PageRank: sorting

sort –r ranks

Ida Mele Ranking 34

0.201978506316694950.146004576836513080.142092684688595230.09311949208285820.086085011918961270.079826578179069640.075877858304762110.072946886114660640.05156599403615980.05050241152172527

0.201978506316694950.146004576836513080.142092684688595230.09311949208285820.086085011918961270.079826578179069640.075877858304762110.072946886114660640.05156599403615980.05050241152172527

PageRank values sorted in decreasing order

Homework

1. Repeat the exercise with the graphs:• WikiIT• WikiPTavailable at: http://www.dis.uniroma1.it/~mele/WebIR.html

2. Create a new graph by using synthetic or real data, and repeat the exercise

Ida Mele Ranking 35