introduction to big data
TRANSCRIPT
![Page 1: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/1.jpg)
1
Haifa Big Data Meetup - Meeting 1
Introduction to Big DataOrganizer + Lecture – Nathan Krasney
Nathan Krasney 23/6/15
![Page 2: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/2.jpg)
Nathan Krasney 23/6/15 2
Introduction to Big Data
• Big Data use cases• What is Big Data :– Definitions– Technologies
• Why is the future so bright for Big Data
![Page 3: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/3.jpg)
Nathan Krasney 23/6/15 3
Use Cases – A•http://www.ted.com/playlists/56/making_sense_of_too_much_data• We have in recent years huge amount of data
coming from users : Blogs, Web Sites, Forums ,Facebook , YouTube, LinkedIn,…
• Data is mostly personal : post, like , profile, …• Data contains personal preferences , geographic
location, …. of hundreds million of people in a scale that did not exist few years ago.
• It is possible to process this data using Machine Learning algorithm to get very interesting personal characteristics of people
![Page 4: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/4.jpg)
Nathan Krasney 23/6/15 4
Use Cases – A con’dFacebook Active Users Per Month [in millions]
![Page 5: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/5.jpg)
Nathan Krasney 23/6/15 5
Use Cases – A con’d
What kind of info can we produce by processing data on the web ?
• Political preferences• Personal characteristics• Age• Gender• Religious• Intelligence• Consumer preferences
![Page 6: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/6.jpg)
Nathan Krasney 23/6/15 6
Use Cases – A1
Example 1 : facebook likesA research conducted lately has found the top 5
likes which indicated intelligent peopleFor example clicking on this page. But why ?
![Page 7: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/7.jpg)
Nathan Krasney 23/6/15 7
Use Cases – A1 con’d
in general ,people tends to choose their friend to be like them. For example , young people will choose young people as their friends, smart people will choose smart people as their friends and so on.
It turns out that this particular page was liked by a group of intelligent people and it spread on the web virally via the likes of their friends (who also have high intelligence).
But this could be concluded only by having big data and being able to process it to come out with this conclusion.
![Page 8: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/8.jpg)
Nathan Krasney 23/6/15 8
Use Cases – A2Example 2 - Forbes magazine a company name Target started to send particular family suggestions for baby clothing even before the daughter has told her parents she is pregnant. How did Target know about it ?
![Page 9: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/9.jpg)
Nathan Krasney 23/6/15 9
Use Cases – A2 con’d
• It turns out that the company -https://corporate.target.com/ has huge data base of shopping done on their stores. Furthermore, the company has smart algorithm that identify pregnancy given the shopping a woman does at Target
• The algorithm identify the pregnancy due date !!!• The algorithm has identified the girl pregnancy not
necessarily given baby products bought but by vitamins she bought and bigger hand bag (for dippers) and other indirect characteristics
• Sales of the company in 2014 have reached 71 billion $ and the company exist from 1902 so she quite big data …
![Page 10: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/10.jpg)
Nathan Krasney 23/6/15 10
Use Cases – A2 con’d
• The huge data – big data that Target has gathered about her customers and their purchases has allowed the company to get Behavioral Patterns that indicated coming pregnancy using purchase of items like vitamins , bigger bag and so on
![Page 11: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/11.jpg)
Nathan Krasney 23/6/15 11
Use Cases – A3
Example 3• Processing the huge amount of personal data that
publically exist on the web : Facebook , LinkedIn , forums , web sites , blogs , YouTube, Instegram ,… to predict personal profile. This can help e.g. HR offices, Companies hiring people…
• Identifying the social group you belong to using clustering can further improve this predicted profile
• Better prediction of the user profile worth more money
![Page 12: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/12.jpg)
Nathan Krasney 23/6/15 12
What is Big Data?
• 3 V’s :– Volume– Velocity– Variety
![Page 13: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/13.jpg)
Nathan Krasney 23/6/15 13
What is Big Data ? Con’d
![Page 14: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/14.jpg)
Nathan Krasney 23/6/15 14
What is Big Data ? Con’d
![Page 15: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/15.jpg)
Nathan Krasney 23/6/15 15
What is Big Data ? Con’d
![Page 16: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/16.jpg)
Nathan Krasney 23/6/15 16
What is Big Data ? Con’dה אחר – vשלושת מכיוון ים
![Page 17: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/17.jpg)
Nathan Krasney 23/6/15 17
What is Big Data ? Con’d• Data model - what fields of data will be stored and
how : data type and any restrictions on the data input• Structured data – data model based e.g. relational
database. Need schema• Unstructured Data – no data model e.g. E-mails, pdf
files, web pages, videos, audios , photos. Schema free. Suits NoSQL
• Batch : offline processing. e.g. by Hadoop• Streaming : online processing (real-time) . E.g. by Spark• Terabyte – 1,000 GB• Zettabyte – 1,000,000,000 TB
![Page 18: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/18.jpg)
Nathan Krasney 23/6/15 18
What is Big Data ? Con’dה נוסף – vשלושת מכיוון ים
![Page 19: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/19.jpg)
Nathan Krasney 23/6/15 19
What is Big Data ? Con’d
Social media and networks(all of us are generating data)
Scientific instruments(collecting all sorts of data)
Mobile devices (tracking all objects all the time)
Sensor technology and networks(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion
Who’s Generating Big Data
![Page 20: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/20.jpg)
Nathan Krasney 23/6/15 20
What is Big Data ? Con’d
Batch use case – Blackberry (good times stat…)Data :• Instrumentation data from devices• 650 TB daily, 100 PB total
Processing is used for business analytics e.g. view graphs
![Page 21: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/21.jpg)
Nathan Krasney 23/6/15 21
What is Big Data ? Con’dBatch use case – CBS Interactive (online content
network for information and entertainment.)Data :• 1 PB of content , click streams , web logs• 1 PB events tracked daily
Processing is used for business analytics e.g. to identify user patterns e.g. “high value” users to target content
![Page 22: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/22.jpg)
Nathan Krasney 23/6/15 22
What is Big Data ? Con’d
Streaming use case – Cyber security (fraud detection) by RSA
Machine learning may stop credit card transaction which are suspicious. E.g. an Israeli person buy a lot online , however, once he travel to china he might be blocked for the same online buy.
![Page 23: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/23.jpg)
Nathan Krasney 23/6/15 23
What is Big Data ? Con’dSo we have gathered huge amount of data, now
what ?
The problem – processing big dataTraditional large scale computation used strong computer (super computer):
• faster processors • more memory
![Page 24: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/24.jpg)
Nathan Krasney 23/6/15 24
What is Big Data ? Con’dbut even this was not enoughBetter solution is distributed system - use
multiple machine for single job.But this also has its problems :• programming complexity - keeping data
and processes in sync• finite bandwidth• partial failures - e.g. one computer fails
should not keep the system down
![Page 25: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/25.jpg)
Nathan Krasney 23/6/15 25
What is Big Data ? Con’dmodern systems have much more data• terabytes (1000 gigabytes) a day • petabytes (1000 terabyte) total
The approach of central data place is not suitable for big data
![Page 26: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/26.jpg)
Nathan Krasney 23/6/15 26
What is Big Data ? Con’d
![Page 27: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/27.jpg)
Nathan Krasney 23/6/15 27
What is Big Data ? Con’dThe new approach – Apache Hadoop
A software framework for storing , processing and analyzing big data
• Distributed• scalable• fault tolerant• open source• Eco system
![Page 28: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/28.jpg)
Nathan Krasney 23/6/15 28
What is Big Data ? Con’dThe new approach – Hadoop
Hadoop core components :
• HDFS (Hadoop Distributed File System) - store the data on the cluster
• MapReduce - process the data on the cluster
![Page 29: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/29.jpg)
Nathan Krasney 23/6/15 29
What is Big Data ? Con’dHDFS basic concepts
• HDFS is a file system written in java• Sit on top of native file system e.g. Linux• storage of massive amount of data :– scalable– fault tolerant– supports efficient processing with MapReduce
![Page 30: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/30.jpg)
Nathan Krasney 23/6/15 30
What is Big Data ? Con’dHDFS basic concepts
Cluster may hundreds or thousands of servers
![Page 31: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/31.jpg)
Nathan Krasney 23/6/15 31
What is Big Data ? Con’dHDFS basic concepts
How files are stored
• Data files are splited into blocks and distributed to the data nodes(computer)
• Each block is replicated on multiple node (3 is default)
• NameNode stores metadata
![Page 32: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/32.jpg)
Nathan Krasney 23/6/15 32
What is Big Data ? Con’dHDFS basic concepts
![Page 33: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/33.jpg)
Nathan Krasney 23/6/15 33
What is Big Data ? Con’dGet data in \ out of HDFS
![Page 34: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/34.jpg)
Nathan Krasney 23/6/15 34
What is Big Data ? Con’dMapReduce
MapReduce has 3 main phases :
phase 1 - The Mapper• Each task works (typically) on one HDFS block• Map task run (typically) on the same node where the block is stored
phase 2 - Shuffle & Sort• sort and collect all intermediate data from all mappers• happens after all Map tasks are completed
phase 3 - The Reducer• operate on sorted \ shuffled intermediate data - previous phase output• produces final output
![Page 35: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/35.jpg)
Nathan Krasney 23/6/15 35
What is Big Data ? Con’dExample : counting words
![Page 36: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/36.jpg)
Nathan Krasney 23/6/15 36
What is Big Data ? Con’dPhase 1 - The mapper map the text
![Page 37: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/37.jpg)
Nathan Krasney 23/6/15 37
What is Big Data ? Con’dPhase 2 - Shuffle & Sort
![Page 38: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/38.jpg)
Nathan Krasney 23/6/15 38
What is Big Data ? Con’dPhase 3 – Reduce
![Page 39: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/39.jpg)
Nathan Krasney 23/6/15 39
What is Big Data ? Con’dIt is important to understand that :
• Map tasks run in parallel - this reduce computation time.
• Map tasks run on the machines that contains the data so there is no network traffic issues
• Reduce also runs in parallel
![Page 40: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/40.jpg)
Nathan Krasney 23/6/15 40
What is Big Data ? Con’dCore Hadoop concepts :
• applications are written in high level languages• nodes talk to each other as little as possible• data is distributed in advanced• data is replicated for increased availability and
reliability• Hadoop is scalable and fault tolerant
![Page 41: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/41.jpg)
Nathan Krasney 23/6/15 41
What is Big Data ? Con’dFault tolerance :• node failure is inevitable• what to do in this case :– system continues to function– master re-assign tasks to a different node– data replication - so no lost of data– node which recover rejoin the cluster
automatically
![Page 42: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/42.jpg)
Nathan Krasney 23/6/15 42
What is Big Data ? Con’dScalability means • adding more nodes is linearly proportional to
capacity• increase load result in graceful decline in
performance and not failure
![Page 43: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/43.jpg)
Nathan Krasney 23/6/15 43
What is Big Data ? Con’dHadoop Eco system
![Page 44: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/44.jpg)
Nathan Krasney 23/6/15 44
What is Big Data ? Con’d
Hadoop Ecosystem• querying data : Hive , Pig, Impala• Data store : Hbase (Big table like over HDFS)• get data into HDFS : Flume• Schedulers (e.g. Hadoop Map/Reduce jobs, Pig
jobs): Oozie• Machine Learning : Mahout
![Page 45: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/45.jpg)
Nathan Krasney 23/6/15 45
What is Big Data ? Con’dWho uses Hadoop
![Page 46: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/46.jpg)
Nathan Krasney 23/6/15 46
What is Big Data ? Con’d
Spark
The problem : MapReduce may be slow and does only batch processing
Solution – Spark• Can do both batch and streaming• Apache Spark processes data in-memory while Hadoop
MapReduce persists back to the disk after a map or reduce action. Up to X100 better processing time
![Page 47: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/47.jpg)
Nathan Krasney 23/6/15 47
What is Big Data ? Con’dNoSQL (Not only SQL)The problem : storage and retrieval of unstructured data,
typically huge amount of it.
The solution :• NoSQL database• The data structures used by NoSQL databases : – key-value : key is the identifier – Graph : nodes + edges to represent relationship– document : store data as JSON document (MongoDB ,
CouchDB,..)– …
![Page 48: Introduction to big data](https://reader035.vdocument.in/reader035/viewer/2022062515/55cc7006bb61ebe4748b47b4/html5/thumbnails/48.jpg)
Nathan Krasney 23/6/15 48
Why is the future so bright for Big Data
• IOT (Internet Of Things) will add huge amount of data in the coming years
• Cloud allows us to save easily a lot of data• More data is stored as time goes by on the net,
Companies , institutions,…• Data processing abilities improves As time goes by (Hadoop
, Spark)• the ability to store huge amount of data improves as time
goes by • The ability to store more data + better processing leads to
smarter info that can be retrieved from the data• Smart info is power = money