large scale data processing with dryadlinq

Large Scale Data Processing with DryadLINQ

Dennis FetterlyMicrosoft Research, Silicon Valley

Workshop on Data-Intensive Scientific Computing Using

DryadLINQ

http://research.microsoft.com/en-us/events/dryadlinq2010/



Outline

• Brief introduction to TidyFS• Preparing/loading data onto a cluster• Desirable properties in a Dryad cluster• Detailed description of several IR algorithms

TidyFS goals

• A simple distributed filesystem that provides the abstractions necessary for data parallel computations

• High performance, reliable, scalable service• Workload – High throughput, sequential IO, write once– Cluster machines working in parallel– Terasort

• 240 machines reading at 240 MB/s = 56 GB/s• 240 machines writing at 160 MB/s = 37 GB/s

TidyFS Names

• Stream: a sequence of partitions – i.e. tidyfs://dryadlinqusers/fetterly/clueweb09-English– Can have leases for temp files or cleanup from crashes

• Partition:– Immutable– 64 bit identifier– Can be a member of multiple streams– Stored as NTFS file on cluster machines– Multiple replicas of each partition can be stored

Stream-1 Part 1 Part 2 Part 3 Part 4

Preparation of Data

• Often substantially harder than it appears• Issues:– Data format– Distribution of data– Network bandwidth

• Generating synthetic datasets is sometimes useful

Data Prep – Format

• Text records are simplest– Caveat – information that is not in the line• e.g. - if a line number encodes information

• Binary records often require custom code to load to cluster– Serialization/de-serialization code generated by

DryadLINQ uses C# Reflection

Custom Deserialization Codepublic class UrlDocIdScoreQuery{ public string queryId; public string url; public string docId; public string queryString; public double score;

public static UrlDocIdScoreQuery Read(DryadBinaryReader reader) { UrlDocIdScoreQuery rec = new UrlDocIdScoreQuery(); rec.queryId = ReadAnyString(reader); rec.queryString = ReadAnyString(reader); rec.url = ReadAnyString(reader); rec.docId = ReadAnyString(reader); rec.score = reader.ReadDouble(); return rec; } public static string ReadAnyString(DryadBinaryReader dbr) {…}}

Data Prep - Loading

• DryadLINQ job– Often needs a dummy input anchor

• Custom program– Write records to TidyFS partitions

• “SneakerNet” often a good option

Data Loading - DryadLINQ

• Need input “anchor” to run on cluster– Generate or use existing stream

• Sample:IEnumerable<Entry> GenerateEntries(Random x, int numItems){

for (int i = 0; i < numItems; i++) {// code to generate records yield return record;

}}

Data Gen - DryadLINQ

• Need input “anchor” to run on cluster– Generate or use existing stream

• Sample:IEnumerable<Entry> GenerateEntries(Random x, int numItems){

for (int i = 0; i < numItems; i++) {// code to generate records yield return record;

}}

DryadLINQ Job

var streamname = "tidyfs://datasets/anchor”;var os = @"tidyfs://msri/teamname/data?compression=" + CompressionScheme.GZipFast;var r = PartitionedTable.Get<int>(streamname)

.Take(1) .SelectMany(x => Enumerable.Range(0, partitions)) .HashPartition(x => x, partitions) .Select(x => new Random(x)) .SelectMany(x => GenerateEntries(x, numItems)) .ToPartitionedTable(os);

Data Loading - Databases

• Bulk copy into files– Use queries to produce multiple files

• Perform queries within DryadLINQ UDFIEnumerable<Entry> PerformQuery(string queryArg){

var results = “select * from …”;foreach (var record in results) {

yield return record;}

}

Building a cluster

• Overall goal – a high-throughput system– Not latency sensitive

• More slower computers often better than fewer faster computers

• Multiple cores better that frequency• Multiple disks – increase throughput• Sufficient RAM

Networking a Cluster

• Network topology – medium to large clusters– Attempt to maximize cross rack bandwidth– Two tier topology• Rack switches and core switches

• Port aggregation– Bond multiple connections together • 1 GbE or 10 GbE

Cluster Software

• Runs on Windows HPC Server 2008• Academic Release – For non-commercial use

• Commercial License

DryadLINQ IR Toolkit

• Library that uses DryadLINQ• Source code for a number of IR algorithms– Text retrieval - BM25/BM25F– Link based ranking - PageRank/SALSA-SETR– Text processing - Shingle based duplicate detection

• Designed to work well with ClueWeb09 collection– Including preprocessing the data to load the cluster

• Available from http://research.microsoft.com/dryadlinqir/

ClueWeb09 Collection

• Collected/Distributed by CMU• 1 billion web pages crawled in Jan/Feb 2009• 10 different languages– en, zh, es, ja, de, fr, ko, it, pt, ar

• 5 TB, compressed - 25 TB, uncompressed • Available to research community• Dataset available for your projects– Web graph, 503m English web pages

Example: Term Frequencies

Count term frequencies in a set of documents:

var docs = new PartitionedTable<Doc>(“tidyfs://dennis/docs”);var words = docs.SelectMany(doc => doc.words);var groups = words.GroupBy(word => word);var counts = groups.Select(g => new WordCount(g.Key, g.Count()));counts.ToPartitionedTable(“tidyfs://dennis/counts.txt”);

IN

metadata

SM

doc => doc.words

GB

word => word

S

g => new …

OUT

metadata

Distributed Execution of Term Freq

SM

DryadLINQGB

S

LINQ expression

IN

OUT

Dryad execution

20

Execution Plan for Term Frequency

(1)

SM

GB

S

SM

Q

GB

C

D

MS

GB

Sum

SelectMany

Sort

GroupBy

Count

Distribute

Mergesort

GroupBy

Sum

pipelined

pipelined

21

Execution Plan for Term Frequency

(1)

SM

GB

S

SM

Q

GB

C

D

MS

GB

Sum

(2)

SM

Q

GB

C

D

MS

GB

Sum

SM

Q

GB

C

D

MS

GB

Sum

SM

Q

GB

C

D

MS

GB

Sum

BM25 “Grep”• For batch evaluation of queries calculating

BM25 is just a select operation string queryTermDocFreqURLLocal = @"E:\TREC\query-doc-freqs.txt";Dictionary<string, int> dfs = GetDocFreqs(queryTermDocFreqURLLocal); PartitionedTable<InitialWordRecord> initialWords = PartitionedTable.Get<InitialWordRecord>(initialWordsURL);

var BM25s = from doc in initialWords select ComputeDocBM25(queries, doc, dfs);

BM25s.ToPartitionedTable(“tidyfs://dennis/scoredDocs”);

PageRankRanks web pages by propagating scores along hyperlink structure

Each iteration as an SQL query:

1. Join edges with ranks2. Distribute rank on edges3. GroupBy edge destination4. Aggregate into ranks.5. Repeat.

One PageRank Step in DryadLINQ// one step of pagerank: dispersing and re-accumulating rankpublic static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks){ // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank);

// re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum());}

A Complete DryadLINQ Program

var pages = DryadLinq.GetTable<Page>(“tidyfs://pages.txt”); // repeat the iterative computation several times var ranks = pages.Select(page => new Rank(page.name, 1.0)); for (int iter = 0; iter < iterations; iter++) { ranks = PRStep(pages, ranks); } ranks.ToDryadTable<Rank>(“outputranks.txt”);

public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links;

public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; }

public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } }

public struct Rank { public UInt64 name; public double rank;

public Rank(UInt64 n, double r) { name = n; rank = r; } }

public static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank);

// re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum());}

PageRank Optimizations

• Benchmark PageRank on 954m page graph• Naïve approach – 10 iter ~3.5 hours 1.2TB• Apply several optimizations– Change data distribution– Pre-group pages by host– Renaming host groups with dense names– Cull out leaf nodes– Pre-aggregate ranks for each host

• Final version – 10 iter 11.5 min 116 GB

Tactics for Improving Performance

• Loop unrolling• Reduce data movement– Improve data locality

• Choose what to Group

Gotchas

• Non-deterministic output– E.g. RNG in user defined function

• Writing to shared state

Schedule for Today

• 9:30 – 10:00 Meet with team, finalize project• 10:30-12:00 Work on projects, discuss

approach with a speaker

Backup Slides

Cluster Configuration

Head Node

TidyFS ServersCluster machines running tasks and TidyFS storage service

How a Dryad job reads from TidyFS

TidyFS Service

Cluster Machines

Job ManagerList Partitions in Stream Part 1, Machine 1

Part 2, Machine 2

Schedule VertexPart 1

Schedule VertexPart 2

Machine 1

Machine 2

Get Read PathMachine 1, Part 1

Get Read PathMachine 2, Part 2

D:\tidyfs\0001.data

D:\tidyfs\0002.data…

How a Dryad job writes to TidyFS

TidyFS ServiceCluster Machines

Job Manager

Machine 1

Machine 2

Schedule Vertex 1

Schedule Vertex 2

createStr1_v1

createStr1_v2

Part 1

Part 2

Str1_v1

Str1_v2

Part1

Part 2

…

How a Dryad job writes to TidyFS

TidyFS ServiceCluster Machines

Job Manager

Machine 1 GetWritePathMachine 1, Part 1

GetWritePathMachine2, Part 2

D:\tidyfs\0001.data

D:\tidyfs\0002.dataMachine 2

Str1_v1

Str1_v2

Part1

Part 2

Completed

Completed

Create Str1

Str1

ConcatenateStreams(str1, str1_v1, str1_v2)

Delete Streamsstr1_v1, str1_v2

AddPartitionInfo(Part 1, Machine 1, Size, Fingerprint, …)

AddPartitionInfo(Part 2, Machine 2, Size, Fingerprint, …)

…

large scale data processing with dryadlinq

Documents

tidyfspreparingloading

numitems i

int numitems

records yield return

tidyfs partitionssneakernet

informationbinary records

custom code

var record