mapreduce & cloud pengbo dec 6, 2010. mapreduce imperative programming in computer science,...
TRANSCRIPT
MapReduce & Cloud
PengBoDec 6, 2010
MapReduce
Imperative Programming
In computer science, imperative programming is a programming paradigm that describes computation in terms of statements that change a program state.
Declarative Programming
In computer science, declarative programming is a programming paradigm that expresses the logic of a computation without describing its control flow
Functional Language
map f lst: (’a->’b) -> (’a list) -> (’b list)把 f 作用在输入 list 的每个元
素上,输出一个新的 list.
fold f x0 lst: ('a*'b->'b)->
'b->('a list)->'b 把 f 作用在输入 list 的每个
元素和一个累加器元素上,f 返回下一个累加器的值
f f f f f f f f f f f returned
initial
From Functional Language View
map f lst: (’a->’b) -> (’a list) -> (’b list)把 f 作用在输入 list 的每个元
素上,输出一个新的 list.
fold f x0 lst: ('a*'b->'b)->
'b->('a list)->'b 把 f 作用在输入 list 的每个
元素和一个累加器元素上,f 返回下一个累加器的值
f f f f f f f f f f f returned
initial
Functional 运算不修改数据,总是产生新数据 map 和 reduce 具有内在的并行性
Map 可以完全并行 Reduce 在 f 运算满足结合律时,可以乱序并发执行
Functional 运算不修改数据,总是产生新数据 map 和 reduce 具有内在的并行性
Map 可以完全并行 Reduce 在 f 运算满足结合律时,可以乱序并发执行
Reduce foldl : (a [a] a)
Example
fun foo(l: int list) = sum(l) + mul(l) + length(l)
fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
MapReduce is…
“MapReduce is a programming model and an associated implementation for processing and generating large data sets.”[1]
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150.
From Parallel Computing View
MapReduce 是一种并行编程模型
the essence is a single function that executes in parallel on independent data sets, with outputs that are eventually combined to form a single or small number of results.
the essence is a single function that executes in parallel on independent data sets, with outputs that are eventually combined to form a single or small number of results.
f 是一个 map 算子 map f (x:xs) = f x : map f xsg 是一个 reduce 算子 reduce g y (x:xs) = reduce g ( g y x) xs
homomorphic skeletons
Mapreduce Framework
Typical problem solved by MapReduce
读入数据 : key/value 对的记录格式数据 Map: 从每个记录里 extract something
map (in_key, in_value) -> list(out_key, intermediate_value) 处理 input key/value pair 输出中间结果 key/value pairs
Shuffle: 混排交换数据 把相同 key 的中间结果汇集到相同节点上
Reduce: aggregate, summarize, filter, etc. reduce (out_key, list(intermediate_value)) ->
list(out_value) 归并某一个 key 的所有 values ,进行计算 输出合并的计算结果 (usually just one)
输出结果
Shuffle Implementation
Partition and Sort Group
Partition function: hash(key)%reducer numberGroup function: sort by key
Word Frequencies in Web pages
输入: one document per record 用户实现 map function ,输入为
key = document URL value = document contents
map 输出 (potentially many) key/value pairs. 对 document 中每一个出现的词,输出一个记录 <word, “1”>
Example continued:
MapReduce 运行系统 ( 库 ) 把所有相同 key 的记录收集到一起 (shuffle/sort)
用户实现 reduce function 对一个 key 对应的 values计算
求和 sum
Reduce 输出 <key, sum>
Inverted Index
Build Inverted Index
Map: <doc#, word> ➝[<word, doc-num>]Reduce: <word, [doc1, doc3, ...]> ➝ <word, “doc1, doc3, …”>
Map: <doc#, word> ➝[<word, doc-num>]Reduce: <word, [doc1, doc3, ...]> ➝ <word, “doc1, doc3, …”>
Build index
Input: web page data Mapper:
<url, document content> <term, docid, locid> Shuffle & Sort:
Sort by term Reducer:
<term, docid, locid>* <term, <docid,locid>*> Result:
Global index file, can be split by docid range
Quiz
PageRank Algorithm Clustering Algorithm Recommendation Algorithm
1. 串行算法表述1. 算法的核心公式、步骤描述和说明2. 输入数据表示、核心数据结构
2. MapReduce 下的实现:1. map, reduce 如何写2. 各自的输入和输出是什么
1. 串行算法表述1. 算法的核心公式、步骤描述和说明2. 输入数据表示、核心数据结构
2. MapReduce 下的实现:1. map, reduce 如何写2. 各自的输入和输出是什么
Stories of the Cloud…
A Picture is Worth…
The Information Factories
Googleplex servers number 450,000,
according to the lowest estimate
200 petabytes of hard disk storage
four petabytes of RAM To handle the current load
of 100 million queries a day, input-output bandwidth
must be in the neighborhood of 3 petabits per second
The Supercomputer that Connects Everything and Everyone
LARRY PAGE : And, actually, the ultimate search
engine, which would understand, you know, exactly what you wanted when you typed in a query, and it would give you the exact right thing back,
in computer science we call that artificial intelligence.
That means it would be smart, and we're a long ways from having smart computers.
The Prototype (1995)
Early Google System
Spring 2000 Design
Late 2000 Design
Spring 2001 Design
Empty Google Cluster
Three Days Later…
Age of DataCenters
High-end MainFrame .vs. commodity PC Cluster
性价比高, scale outBut 可靠性差
性价比高, scale outBut 可靠性差
Scale in可靠性高Scale in可靠性高
High Capability System
SC5832 5832 Gigaflops 7776 Gigabytes ECC memory 972 6-core 64-bit nodes 2916 2 GByte/s fabric links about 1 microsecond MPI
latency 108 8-lane PCI-Express 18 KW 1 Cabinet
Millicomputers 2007
Millicomputers 2008
Guesses for 2010??
Packaging Comparisons in 1U
Cloud Computing
“The desktop is dead. Welcome to the Internet cloud, where massive facilities across the globe will store all the data you'll ever use.”
What is Cloud Computing?
1. First write down your own opinion about “cloud computing” , whatever you thought about in your mind.
2. Question: What ? Who? Why? How? Pros and cons?
3. The most important question is: What is the relation with me?
Cloud Computing is…
No software access everywhere by Internet power -- Large-scale data processing Appeal for startups
Cost efficiency 实在是太方便了 Software as platform
Cons Security Data lock-in
SaaSPaaS
Utility Computing
SaaSPaaS
Utility Computing
Software as a Service (SaaS)
a model of software deployment whereby a provider licenses an application to customers for use as a service on demand.
Platform as a Service (PaaS)
对于开发Web Application 和 Services , PaaS提供了一整套基于 Internet的,从开发,测试,部署,运营到维护的全方位的集成环境。特别它从一开始就具备了Multi-tenant architecture,用户不需要考虑多用户并发的问题,而由 platform来解决,包括并发管理,扩展性,失效恢复,安全。
Utility Computing
“pay-as-you-go” 好比让用户把电源插头插在墙上,你得到的电压和Microsoft得到的一样,只是你用得少,pay less ; utility computing的目标就是让计算资源也具有这样的服务能力,用户可以使用 500强公司所拥有的计算资源,只是 use less pay less。这是 cloud computing的一个重要方面
Cloud Computing is…
Key Characteristics
illusion of infinite computing resources available on demand;
elimination of an up-front commitment by Cloud users; 创业启动花费
ability to pay for use of computing resources on a short-term basis as needed 。小时间片的billing ,报告指出 utility computing 在这一点上的实践是失败的
very large datacentersvery large datacenters
large-scale software infrastructurelarge-scale software infrastructure
operational expertiseoperational expertise
Why now?
very large-scale datacenter的实践, 因为新的技术趋势和 Business模式
pay-as-you-go computing
Key Players
Amazon Web Services Google App Engine Microsoft Windows
Azure
Key Applications
Mobile Interactive applications, Tim O’Reilly 相信未来是属于能够实时对用户提供信息的服务。 Mobile 必定是关键。而后台在 datacenter 中运行是很自然的模式,特别是那些 mashup 融合类型的服务。
Parallel batch processing 。大规模数据处理使用 Cloud Computing 技术很自然, MapReduce , Hadoop 在这里起到重要作用。这里,数据移入 / 移出 cloud 是很大的开销,Amazon 开始尝试 host large public datasets for free 。
The rise of analytics 。数据库应用中 transaction based 应用还在增长,而 analytics 的应用增长迅速。数据挖掘,用户行为分析等应用的巨大推动。
Extension of compute-intensive desktop application 。计算密集型的任务,说 matlab, mathematica 都有了cloud computing 的扩展, woo~
Cloud Computing = Silver Bullet?
Google 文档在 3 月 7 日发生了大批用户文件外泄事件。美国隐私保护组织就此提请政府对 Google采取措施,使其加强云计算产品的安全性。
Problem of Data Lock-in
Challenges
Some other Voices
It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true.Richard Stallman, quoted in The Guardian, September 29, 2008
It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true.Richard Stallman, quoted in The Guardian, September 29, 2008
The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do. . . . I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads.Larry Ellison, quoted in the Wall Street Journal, September 26, 2008
The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do. . . . I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads.Larry Ellison, quoted in the Wall Street Journal, September 26, 2008
What’s matter with ME?!
What you want to do with 1000pcs, or even 100,000 pcs?
Cloud is coming…
Cloud Computing Initiative
Google and IBM team on cloud computing initiative for universities(2007-1008) provide several hundred
computers access through the Internet to
test parallel programming projects
The idea for the program from Google senior software engineer Christophe Bisciglia Google Code University
M45 : Open Academic Clusters
Collaboration with Major Research Universities
Foster open research Focus on large-scale, highly parallel
computing Seed Facility: Datacenter in a Box (DiB)
500 nodes, 4000 cores, 3TB RAM, 1.5PB disk
High bandwidth connection to Internet Located on Yahoo! corporate campus
Runs Yahoo! / Apache Grid Stack Carnegie Mellon University is Initial
Partner Public Announcement 11/12/07
Summary
MapReduce Distributed
Programming Model It’s fun!
Infrastructure Cloud computing Imagination!
Readings
[1] J. D. a. S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150.
Resources
[Ghemawat,2004] J. D. a. S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150.
[Gruber,2006]F. C. a. J. D. a. S. G. a. W. C. H. a. D. A. W. a. M. B. a. T. C. a. A. F. a. R. Gruber, "Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)," in Osdi, 2006, pp. 205-218.
[Jeffrey,2006] D. Jeffrey, "Experiences with MapReduce, an abstraction for large-scale computation," in Proceedings of the 15th international conference on Parallel architectures and compilation techniques. Seattle, Washington, USA: ACM Press, 2006.
[Sanjay, et al.,2003] G. Sanjay, G. Howard, and L. Shun-Tak, "The Google file system," in Proceedings of the nineteenth ACM symposium on Operating systems principles. Bolton Landing, NY, USA: ACM Press, 2003.
http://lucene.apache.org/hadoop/, 2008
Thank You!
Q&A
Calculate PageRank
Input: WebGraph <from , <PR,<to>*>> Iteration Until Convergence
Mapper: <from, <PR,<to>*>>
<to , PR / outDegree(from)> <from, <PR,<to>*>> <from, <0,<to>*>>
Shuffle & Sort By <to>
Reducer: <to , valude>* 以及 <to, <0, <out>*>
<to, ∑(value), <out>*> Result:
<to, ∑(value)> are PR[] , the PageRank result array
Mapreduce Framework
Data store 1 Data store nmap
(key 1, values...)
(key 2, values...)
(key 3, values...)
map
(key 1, values...)
(key 2, values...)
(key 3, values...)
Input key*value pairs
Input key*value pairs
== Barrier == : Aggregates intermediate values by output key
reduce reduce reduce
key 1, intermediate
values
key 2, intermediate
values
key 3, intermediate
values
final key 1 values
final key 2 values
final key 3 values
...