a new partnership for cross-scale, cross-domain escience

64
A New Partnership for eScience Bill Howe, UW Ed Lazowska, UW Garth Gibson, CMU Christos Faloutsos, CMU Peter Lee, CMU (DARPA) Chris Mentzel, Moore QuickTime™ and a decompressor are needed to see this pictu

Upload: bill-howe

Post on 10-May-2015

573 views

Category:

Technology


0 download

DESCRIPTION

Overview of the Moore Foundation-funded collaborative project between CMU and UW to advance eScience given at the Microsoft eScience 2009 workshop

TRANSCRIPT

Page 1: A New Partnership for Cross-Scale, Cross-Domain eScience

A New Partnership for eScience

Bill Howe, UW

Ed Lazowska, UW

Garth Gibson, CMU

Christos Faloutsos, CMU

Peter Lee, CMU (DARPA)

Chris Mentzel, Moore

QuickTime™ and a decompressor

are needed to see this picture.

Page 2: A New Partnership for Cross-Scale, Cross-Domain eScience
Page 3: A New Partnership for Cross-Scale, Cross-Domain eScience

http://escience.washington.edu

Page 4: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 4

Page 5: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 5

\

Page 6: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 6

The University of Washington eScience Institute

Rationale The exponential increase in sensors is transitioning all fields of science and

engineering from data-poor to data-rich Techniques and technologies include

Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing

If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive

Mission Help position the University of Washington at the forefront of research both in

modern eScience techniques and technologies, and in the fields that depend upon them

Strategy Bootstrap a cadre of Research Scientists Add faculty in key fields Build out a “consultancy” of students and non-research staff

QuickTime™ and a decompressor

are needed to see this picture.

Page 7: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 7

Staff and Funding

Funding $1M/year direct appropriation from WA State Legislature $1.5M from Gordon and Betty Moore Foundation (joint with CMU) Multiple proposals outstanding

Staffing Dave Beck, Research Scientist: Biosciences and software eng. Jeff Gardner, Research Scientist: Astrophysics and HPC Bill Howe,Research Scientist: Databases, visualization, DISC Ed Lazowska, Director Erik Lundberg (50%), Operations Director Mette Peters, Health Sciences Liaison Chance Reschke, Research Engineer: large scale computing platforms

…plus a senior faculty search underway …plus a “consultancy” of students and professional staff

Page 8: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 8

All science is reducing to a database problem

Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, in support of many hypotheses)

Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: lab automation, high-throughput sequencing

“Increase data collection exponentially with FlowCam!”

Page 9: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 9

The long tail is getting fatter:

notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)

The Long Tailda

ta v

olum

e

rank

Researchers with growing data management challenges but limited resources for cyberinfrastructure

• No dedicated IT staff

• Over-reliance on inadequate but familiar toolsCERN (~15PB/year)

LSST (~100PB)

PanSTARRS (~40PB)

Ocean Modelers <Spreadsheet

users>

SDSS (~100TB)

Seis-mologists

MicrobiologistsCARMEN (~50TB)

“The future is already here. It’s just not very evenly distributed.” -- William Gibson

Page 10: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 10

Case Study: Armbrust Lab

Page 11: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 11

Armbrust Lab Tech Roadmap

ClustalW

scala

bili

ty

cluster/cloud

workstation/server

MAQsp

ecif

ic ta

sks

gene

ral t

asks

Excel

NCBI BLAST

Phred/Phrap

CloudBurst

CLC Genomics Machine

Hadoop/Dryad

Parallel Databases

?

Azure, AWS

WebBlast*

RDBMS

R

PPlacer*

AnnoJBioPython

Past

Present

Soon

Other tools

specialization

Page 12: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 12

What Does Scalable Mean?

Operationally: In the past: “Works even if data doesn’t fit in main memory” Now: “Can make use of 1000s of cheap computers”

Formally: In the past: polynomial time and space. If you have N data

items, you must do no more than Nk operations Soon: logarithmic time and linear space. If you have N data

items, you must do no more than N log(N) operations

Soon, you’ll only get one pass at the data So you better make that one pass count

Page 13: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 13

A Goal: Cross-Scale Solutions

Gracefully scale up from files to databases to cluster to cloud from MB to GB to TB to PB

“Gracefully” means: logical data independence no expensive ETL migration projects

“Gracefully” means: everyone can use it Hackers / Computational Scientists Lab/Field Scientists The Public K12 Legislators

Page 14: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 14

Data Model Operations Services

GPL * * None for free

Workflow * arbitrary boxes-and-arrows

typing, provenance, Pegasus-style resource mapping, task parallelism

SQL / Relational Algebra

Relations Select, Project, Join, Aggregate, …

optimization, physical data independence, indexing, parallelism

MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance, scheduling

Pig Nested Relations

RA-like, with Nest/Flatten

optimization, monitoring, scheduling

DryadLINQ IQueryable, IEnumerable

RA + Apply + Partitioning

typing, massive data parallelism, fault tolerance

MPI Arrays/ Matrices

70+ ops data parallelism, full control

Page 15: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 15

MapReduce

Many tasks process big data, produce big data Want to use hundreds or thousands of CPUs

... but this needs to be easy Parallel databases exist, but require DBAs and $$$$ …and do not easily scale to thousands of computers

MapReduce is a lightweight framework, providing: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring

Page 16: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 16

public class LogEntry { public string user, ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; }}

public class UserPageCount{ public string user, page; public int count; public UserPageCount( string usr, string page, int cnt){ this.user = usr; this.page = page; this.count = cnt; }}

PartitionedTable<string> logs = PartitionedTable.Get<string>(@”file:\\…\logfile.pt”);var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToPartitionedTable(@”file:\\…\results.pt”);

slide source: Christophe Poulain, MSR

A complete DryadLINQ program

Page 17: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 17

Relational DatabasesPre-relational DBMS brittleness: if your data changed, your application often broke.

Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

physical data independence

logical data independence

files and pointers

relations

views

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation

Page 18: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 18

Relational Databases

Databases are especially, but exclusively, effective at “Needle in Haystack” problems:

Extracting small results from big datasets Transparently provide “old style” scalability Your query will always* finish, regardless of dataset size.

Indexes are easily built and automatically used when appropriateCREATE INDEX seq_idx ON sequence(seq);

SELECT seq FROM sequence WHERE seq = ‘GATTACGATATTA’;

*almost

Page 19: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 19

Key Idea: Data Independence

physical data independence

logical data independence

files and pointers

relations

views

SELECT * FROM my_sequences

SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’;

f = fopen(‘table_file’);fseek(10030440);while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .

Page 20: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 20

Key Idea: An Algebra of Tables

select

project

join join

Other operators: aggregate, union, difference, cross product

Page 21: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 21

Key Idea: Algebraic Optimization

N = ((z*2)+((z*3)+0))/1

Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2:N = (2+3)*z

two operations instead of five, no division operator

Same idea works with the Relational Algebra!

Page 22: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 22

Shared Nothing Parallel Databases

Teradata Greenplum Netezza Aster Data Systems DataAllegro Vertica MonetDB

Microsoft

Recently commercialized as “Vectorwise”

Page 23: A New Partnership for Cross-Scale, Cross-Domain eScience

Case Study: Astrophysics Simulation

Page 24: A New Partnership for Cross-Scale, Cross-Domain eScience

24

N-body Astrophysics Simulation

• 15 years in dev

• 109 particles

• Gravity

• Months to run

• 7.5 million CPU hours

• 500 timesteps

• Big Bang to now

Simulations from Tom Quinn’s Lab, work by Sarah Loebman, YongChul Kwon, Bill Howe, Jeff Gardner, Magda Balazinska

Page 25: A New Partnership for Cross-Scale, Cross-Domain eScience

25

Q1: Find Hot Gas

SELECT id

FROM gas

WHERE temp > 150000

Page 26: A New Partnership for Cross-Scale, Cross-Domain eScience

26

Single Node: Query 1

169 MB 1.4 GB 36 GB

Page 27: A New Partnership for Cross-Scale, Cross-Domain eScience

27

Multiple Nodes: Query 1

Database Z

Page 28: A New Partnership for Cross-Scale, Cross-Domain eScience

28

Multiple Nodes:Query 2

Database Z

Page 29: A New Partnership for Cross-Scale, Cross-Domain eScience

29

Q4: Gas Deletion

SELECT gas1.id

FROM gas1

FULL OUTER JOIN gas2

ON gas1.id=gas2.id

WHERE gas2.id=NULL

Particles removed between two timesteps

Page 30: A New Partnership for Cross-Scale, Cross-Domain eScience

30

Single Node: Query 4

Page 31: A New Partnership for Cross-Scale, Cross-Domain eScience

31

Multiple Nodes: Query 4

Page 32: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 32

Ease of Use

star43 = FOREACH rawGas43 GENERATE $0 AS pid:long; star60 = FOREACH rawGas60 GENERATE $0 AS pid:long; groupedGas = COGROUP star43 BY pid, star60 BY pid;

selectedGas = FOREACH groupedGas GENERATE FLATTEN((IsEmpty(gas43) ? null : gas43)) as s43, FLATTEN((IsEmpty(gas60) ? null : gas60)) as s60;

destroyed = FILTER selectedGas BY s60 is null;

Page 33: A New Partnership for Cross-Scale, Cross-Domain eScience

Visualization and Mashups

Dancing with Data

Page 34: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 34

Data explosion, again

Data growth is outpacing Moore’s Law Why? Cost of acquisition has dropped through the floor Every pairwise comparison of datasets

generates a new dataset -- N2 growth

So: Scalable analysis is necessary But: Scalable analysis is hard

Page 35: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 35

It’s not just the size….

Corollary: # of apps scales as N2

Every pairwise comparison motivates a new application

To keep up, we need to entrain new programmers, make existing programmers more productive, or both

Page 36: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 36

Satellite Images + Crime Incidence Reports

Page 37: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 37

Twitter Feed + Flickr Stream

Page 38: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 38

Zooplankton and Temperature

<Vis movie>

QuickTime™ and a decompressor

are needed to see this picture.

Page 39: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 39

Why Visualization?

High bandwidth of the human visual cortex Query-writing presumes a precise goal Try this in SQL: “What does the salt wedge look like?”

Page 40: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 40

Data Product Ensembles

source: Antonio Baptista, Center for Coastal Margin Observation and Prediction

Page 41: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 41

Example: Find matching sequences

Given a set of sequences Find all sequences equal to

“GATTACGATATTA”

Page 42: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 42

Example System: Teradata

AMP = unit of parallelism

Page 43: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 43

Example System: Teradata

SELECT * FROM Orders o, Lines i WHERE o.item = i.item AND o.date = today()

join

select

scan scan

date = today()

o.item = i.item

Order oItem i

Find all orders from today, along with the items ordered

Page 44: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 44

Example System: Teradata

AMP 1 AMP 2 AMP 3

selectdate=today()

selectdate=today()

selectdate=today()

scanOrder o

scanOrder o

scanOrder o

hashh(item)

hashh(item)

hashh(item)

AMP 1 AMP 2 AMP 3

Page 45: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 45

Example System: Teradata

AMP 1 AMP 2 AMP 3

scanItem i

AMP 1 AMP 2 AMP 3

hashh(item)

scanItem i

hashh(item)

scanItem i

hashh(item)

Page 46: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 46

Example System: Teradata

AMP 1 AMP 2 AMP 3

join join joino.item = i.item o.item = i.item o.item = i.item

contains all orders and all lines where hash(item) = 1

contains all orders and all lines where hash(item) = 2

contains all orders and all lines where hash(item) = 3

Page 47: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 47

MapReduce Programming Model

Input & Output: each a set of key/value pairs Programmer specifies two functions:

Processes input key/value pair Produces set of intermediate pairs

Combines all intermediate values for a particular key Produces a set of merged output values (usually just

one)

map (in_key, in_value) -> list(out_key, intermediate_value)

reduce (out_key, list(intermediate_value)) -> list(out_value)

Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell

slide source: Google, Inc.

Page 48: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 48

Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.

Example: Document Processing

Page 49: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 49

Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.

Example: Word length histogram

How many “big”, “medium”, and “small” words are used?

Page 50: A New Partnership for Cross-Scale, Cross-Domain eScience

Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.

Big = Yellow = 10+ letters

Medium = Red = 5..9 letters

Small = Blue = 2..4 letters

Tiny = Pink = 1 letter

Example: Word length histogram

Page 51: A New Partnership for Cross-Scale, Cross-Domain eScience

Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of nature'sgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying it's foundation on such principles and organizing it's power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed will

dictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.

Example: Word length histogram

Split the document into chunks and process each chunk on a different computer

Chunk 1

Chunk 2

Page 52: A New Partnership for Cross-Scale, Cross-Domain eScience

(yellow, 20)(red, 71)(blue, 93)(pink, 6 )

Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.

Map Task 1(204 words)

Map Task 2(190 words)

(key, value)

(yellow, 17)(red, 77)(blue, 107)(pink, 3)

Example: Word length histogram

Page 53: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 53

(yellow, 17)(red, 77)(blue, 107)(pink, 3)

(yellow, 20)(red, 71)(blue, 93)(pink, 6 )

Reduce tasks

(yellow, 17)(yellow, 20)

(red, 77)(red, 71)

(blue, 93)(blue, 107)

(pink, 6)(pink, 3)

Example: Word length histogram

A Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of nature'sgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying it's foundation on such principles and organizing it's power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed will

dictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.

Map task 1

Map task 2

“Shuffle step”

(yellow, 37)

(red, 148)

(blue, 200)

(pink, 9)

Page 54: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 54

New Example: What does this do?

map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1);

reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result);

slide source: Google, Inc.

Page 55: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 55

Before RDBMS: if your data changed, your application broke.

Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” -- E.F. Codd 1979

Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation

Relational Database Management Systems (RDBMS)

Page 56: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 56

MapReduce is a Nascent Database Engine

Access Methods and Scheduling:

Query Language:

Query Optimizer:

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Pig Latin

Graphics taken from: hadoop.apache.org and research.yahoo.com/node/90

Page 57: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 57

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

MapReduce and Hadoop

MR introduced by Google Published paper in OSDI 2004

MR: high-level programming model and implementation for large-scale parallel data processing

Hadoop Open source MR implementation Yahoo!, Facebook, New York Times

Page 58: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 58

operators: • LOAD• STORE • FILTER• FOREACH … GENERATE • GROUP

binary operators: • JOIN• COGROUP• UNION

other support:• UDFs• COUNT• SUM • AVG• MIN/MAX

Additional operators:http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm

A Query Language for MR: Pig Latin

High-level, SQL-like dataflow language for MR Goal: Sweet spot between SQL and MR

Applies SQL-like, high-level language constructs to accomplish low-level MR programming.

Page 59: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 59

New Task: k-mer Similarity

Given a set of database sequences and a set of query sequences

Return the top N similar pairs, where similarity is defined as the number of common k-mers

Page 60: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 60

Pig Latin program

D = LOAD ’db_sequences.fasta' USING FASTA() AS (did,dsequence);

Q = LOAD ’query_sequences.fasta' USING FASTA() AS (qid,qsequence);

Kd = FOREACH D GENERATE did, FLATTEN(kmers(7, dsequence));Kq = FOREACH Q GENERATE qid, FLATTEN(kmers(7, qsequence));

R = JOIN Kd BY kmer, Kq BY kmer

G = GROUP R BY (qid, did);C = FOREACH G GENERATE qid, did, COUNT(kmer) as scoreT = FILTER C BY score > 4

STORE g INTO seqs.txt';

Page 61: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 61

New Task: Alignment

RMAP alignment implemented in HadoopMichael Schatz, CloudBurst: highly sensitive read mapping with MapReduce, Bioinformatics 25(11), April 2009

Goal: Align reads to a reference genome

Overview: Map: Split reads and reference into k-mers Reduce: for matching k-mers, find end-to-end

alignments (seed and extend)

Page 62: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 62

MapReduce Overhead

QuickTime™ and a decompressor

are needed to see this picture.

Page 63: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 63

Elastic MapReduce

Custom Jar Java

Streaming Any language that can read/write stdin/stdout

Pig Simple data flow language

Hive SQL

Page 64: A New Partnership for Cross-Scale, Cross-Domain eScience

3/12/09 Bill Howe, eScience Institute 64

Taxonomy of Parallel Architectures

Easiest to program, but $$$$

Scales to 1000s of computers