why functional programming is important in big data era
Post on 06-May-2015
325 Views
Preview:
DESCRIPTION
TRANSCRIPT
Why Functional Programming Is Important In Big Data Era?
handaru@tiket.com
What Is Big Data?
What Are The Steps?
Act On
Analyze
Collect
What We Need?
D
Distributed Computing
Cluster
ProcessData
What We Need?
• Spark as data processsing in cluster, originally written in Scala, which allows concise function syntax and interactive use
• Mesos as cluster manager• ZooKeeper as highly reliable distributed
coordinator• HDFS as distributed storage
What We Need?
• Pure functions• Atomic operations• Parallel patterns or skeletons• Lightweight algorithms
The only thing that works for parallel programming is functional programming.
--Carnegie Mello Professor Bob Harper
What Is Functional Programming?
FP Quick Tour In Scala
• Basic transformations:var array = new Array[Int](10)var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
• Indexing:array(0) = 1println(list(0))
• Anonymous functions:val multiplay = (x: Int, y: Int) => x * y
val procedure = { x: Int => {println(“Hello, ”+x)
println(x * 10) } }
FP Quick Tour In Scala
• Scala closure syntax:(x: Int) => x * 10 // full versionx => x * 10 // type interference_ * 10 // underscore syntaxx => { // body is block of code
val y = 10x * y
}
FP Quick Tour In Scala
• Processing collections:var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
list.foreach(x => println(x)) list.map(_ * 10)list.filter(x => x % 2 == 0)list.reduce((x, y) => x + y)list.reduce(_ + _)
def f(x: Int) = List(x-1, x x+1)list.map(x => f(x))list.map(f(_))list.flatMap(x => f(x))list.map(x => f(x)).reduce(_ ++ _)
Spark Quick Tour
• Spark context:• Entry point to Spark functionality• In spark-shell, crated as sc• In standalone-spark-program, we must create it
• Resilient distributed datasets (RDDs) : • A distributed memory abstraction • A logically centralized entity but physically partitioned across multiple
machines inside a cluster based on some notion of key• Immutable• Automatically rebuilt on failure• Based on LRU (Least Recent Use) eviction algorithm
Spark Quick Tour
Working with RDDs
Spark Quick Tour
Cached RDDs
Spark Quick Tour
• Transformations:• Lazy operations to build RDDs from other RDDs
• Narrow transformation (involves no data shuffling) :• map• flatMap• filter
• Wide transformation (involves data shuffling):• sortByKey• reduceByKey• groupByKey
• Actions:• Return a result or write it to storage
• collect• count• take(n)
Spark Quick Tour
Transformations
Spark Quick Tour
• Creating RDDs:val numbers = sc.parallelize(List(1, 2, 3, 4, 5))
val textFile = sc.textFile("hdfs://localhost/test/tobe.txt")val textFile = sc.textFile("hdfs://localhost/test/*.txt")
• Basic transformations:val squares = numbers.map(x => x * x) val evens = squares.filter(_ < 9)val mapto = numbers.flatMap(x => 1 to x)
val words = textFile.flatMap(_.split(" ")).cache()
Base RDD
Transformed RDD
Turn a collection to RDD
Spark Quick Tour
• Basic actions:words.collect()words take(5)words countwords.reduce(_ + _)
words.filter(_ == “be").count()words.filter(_ == “or").count()
words.saveAsTextFile("hdfs://localhost/test/result")
The influence of cache
Spark Quick Tour
• Pair syntax:val pair = (a, b)
• Accessing pair elements:pair._1 pair._2
• Key-value operations:val pets = sc.parallelize(List(("cat", 1), ("dog", 2), ("cat", 3)))pets.reduceByKey(_ + _)pets.groupByKey()pets.sortByKey()
Hello World
val logFile = "hdfs://localhost/test/tobe.txt" val logData = sc.textFile(logFile).cache()val wordCount = logData.flatMap(_.split(“ “)) .map((_, 1)) .reduceByKey(_ + _)
wordCount.saveAsTextFile("hdfs://localhost/wordcount/result") sc.stop()
Execution
Software Components
Application
Spark Context
ZooKeeper
Mesos Master
Mesos Slave
Spark Executor
Mesos Slave
Spark Executor
HDFS/Other Storage
Literature
Parallel Programming With SparkSpark: Low latency, massively parallel processing framework
handaru@tiket.com
handaru@tiket.com
top related