Download - introduction to big data frameworks
![Page 1: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/1.jpg)
03/05/2023
1
Machines Learning in Big Data ( MapReduce, Knime, Spark)Présenté par: Sahmoudi Yahia Targhi Amal
Proposé par : Bouchra Frikh
![Page 2: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/2.jpg)
03/05/2023
2
Outlines Introduction
Big Data
Machine Learning
Applications of ML Techniques to Data mining Tasks
Why Machine Learning in Big Data?
Big Data frameworks processing
![Page 3: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/3.jpg)
03/05/2023
3
Introduction• Every day, 2.5 quintillion bytes of data are created and 90% of the data
in the world today were produced within the past two years (IBM 2012).
• On October 4, 2012, the first presidential debate between President Barack Obama and Governor Mitt Romney triggered more than 10 million tweets within two hours (Twitter Blog 2012).
• Another example is Flickr, a public picture sharing site, which received 1.8 million photos per day, on average, from February to March 2012 (Michel F. 2012).
![Page 4: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/4.jpg)
03/05/2023
4
•“ The most fundamental challenge for the Big Data applications is to explore the large volumes of data and extract useful information or knowledge for future actions “ (Rajaraman and Ullman, 2011).
![Page 5: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/5.jpg)
03/05/2023
5
• What is big data?
• What is machine learning?
• Why do machine learning in big data?
• What are they the big data processing frameworks
which integrate machine learning algorithm?
![Page 6: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/6.jpg)
03/05/2023
6
Big Data• Big data is a term that describes the large volume of
data ( structed, semi- structed, unstructed), which can exploited to get informations.
• Big data can be characterized by 3Vs:
• the extreme volume of data.• the wide variety of types of data• velocity at which the data must be must processed.
![Page 7: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/7.jpg)
03/05/2023
7
P.S and we can also add: Variability and Complexity.
![Page 8: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/8.jpg)
03/05/2023
8
Big Data AnalyticsBig data analytics is the process of examining large data
sets containing a variety of data types.“ It’s important to remember that the primary value from big data comes not from the data in its raw form, but from the processing and analysis of it and the insights, products, and services that emerge from analysis. The sweeping changes in big data technologies and management approaches need to be accompanied by similarly dramatic shifts in how data supports decisions and product/service innovation.” Thomas H. Davenport in Big Data in Big Companies
![Page 9: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/9.jpg)
03/05/2023
9
Machine LearningDesign off implementation methods allowing a machine to evolve by
a systematic process and completing tasks which are difficult
They are difficult for a classic algorithm.
Machine Learning it’s when de data replace the algortihm
Example
Facebook's News Feed changes according to the user's personal interactions with
other users. If a user frequently tags a friend in photos, writes on his wall or likes his
links, the News Feed will show more of that friend's activity in the user's News Feed
due to presumed closeness.
![Page 10: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/10.jpg)
03/05/2023
10
Applications of ML Techniques to Data mining Tasks
• The amount of the data seems to increase rapidly every single day for the
majority domains related to information processing, and the need to find a
way to mine and get knowledge from databases is still crucial.
• Data Mining (DM) defines the automated extraction procedures of hidden
predictive information from databases . Among the tasks of DM are:
• Diagnosis• Pattern Recognition: • Prediction• Classification• Clustering• Optimization: • Control:
![Page 11: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/11.jpg)
03/05/2023
11
The impact of ML techniques used for DM tasks
Artificial Neural Networks for DMGenetic Algorithms for DMInductive Logic Programming for DMRule Induction for DMDecision Trees for DMInstance-based Learning Algorithms for DM
![Page 12: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/12.jpg)
03/05/2023
12
Exemple•Amazon is an interesting example of
application of "Machine Learning" in e-commerce. For example, suppose we perform the search for a product on Amazon today. When we come back another day on the site, it is able to offer us products related to our specific needs (prediction). And thanks to algorithms of "Machine Learning" which provide our changing needs from our previous visits to the site.
![Page 13: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/13.jpg)
03/05/2023
13
Why Machine Learning in big Data? • It delivers on the promise of extracting value from big and
disparate data sources with far less reliance on human direction. It is data driven and runs at machine scale. It is well suited to the complexity of dealing with disparate data sources and the huge variety of variables and amounts of data involved. And unlike traditional analysis, machine learning thrives on growing datasets. The more data fed into a machine learning system, the more it can learn and apply the results to higher quality insights.
![Page 14: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/14.jpg)
03/05/2023
14
Big Data framework s processing
• The Machine Learning is far from being a recent
phenomenon. What is new, however, is the number of
treatments platforms parallelized data for the
management of Big Data.
• in this Presentation we will take a look on some of the most
used algorithms as MapReduce, knime, spark.
![Page 15: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/15.jpg)
03/05/2023
15
MapReduce• MapReduce invented by google, it's the heart
of hadoop and it's is a processing technique and a program model for distributed computing based on java.
• MapReduce allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster., allows to write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
![Page 16: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/16.jpg)
03/05/2023
16
• The MapReduce algorithm refers two important tasks,Map and Reduce:
• Map: Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
• Reduce: takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
![Page 17: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/17.jpg)
03/05/2023
17
Algorithm• Map stage : The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed to
the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
• Reduce stage : This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that
comes from the mapper. After processing, it produces a new set of
output, which will be stored in the HDFS.
![Page 18: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/18.jpg)
03/05/2023
18
![Page 19: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/19.jpg)
03/05/2023
19
Exemple•The Word Count program used in this
section to explain how a map will also be used to explain the operation of a reducer.
•the object of this program is: counting the number of occurrences of the different words which make up a book.
![Page 20: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/20.jpg)
03/05/2023
20
Book Content
book's content
![Page 21: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/21.jpg)
03/05/2023
21
Step 1 : Map
• Line 1: the mapper reads in input a record that comes in the form of a
pair <key, value> with:
• the String value type (a line of the file).
• LongWritable the key type
• Line 2 : for each word in the current line.
• Line 3 : we write on a output file the couple <w,1> , corresponding to
an occurrence of the word in the variable w.
![Page 22: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/22.jpg)
03/05/2023
22
the output file of the mapper
![Page 23: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/23.jpg)
03/05/2023
23
Step 2 ( between map and Reduce: shuffle) and sort
![Page 24: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/24.jpg)
03/05/2023
24
Step3 : Reduce
Line 1: the reducer input a recording as a pair <key, value> with:
Text the key type (one word).
value is a list of values of intWritable type.
Line 2: the reducer resets the counter when the wordcount word is changed.
Line 3: v for each value in the list is added to inValue2 v Word-Count (in this case v is
always 1).
Line4: When we change the word, we write in a output file: couple <inKey2, wordCount>,
wordCount is the number of occurrences of the word.
![Page 25: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/25.jpg)
03/05/2023
25
![Page 26: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/26.jpg)
03/05/2023
26
Use case in real life
• http://hadoopilluminated.com/hadoop_book/Hadoop_Use_Cases.html#d1575e1290
![Page 27: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/27.jpg)
03/05/2023
27
Spark•Apache Spark is an open source cluster computing
framework originally developed in the AMPLab at
University of California,
•Spark extends the popular MapReduce model to
efficiently support more types of computations.
![Page 28: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/28.jpg)
03/05/2023
28
A Unified Stack
• one of the largest advantages of tight integration is the ability to build
applications that seamlessly combine different processing models. For
example, in Spark you can write one application that uses machine
learning to classify data in real time as it is ingested from streaming
sources. Simultaneously, analysts can query the resulting data, also in real
time, via SQL (e.g., to join the data with unstructured logfiles).
• In addition, more sophisticated data engineers and data scientists can
access the same data via the Python shell for ad hoc analysis. Others
might access the data in standalone batch applications. All the while, the
IT team has to maintain only one system.
![Page 29: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/29.jpg)
03/05/2023
29
Rdd• the resilient distributed dataset (RDD). An RDD is simply a
distributed collection of elements. In Spark all work is
expressed as either creating new RDDs, transforming existing
RDDs, or calling operations on RDDs to compute a result. Under
the hood, Spark automatically distributes the data contained in
RDDs across your cluster and Parallelizes the operations you
perform on them.
• RDDs are the core concept in Spark.
• You can see a RDD as a table in a database.
![Page 30: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/30.jpg)
03/05/2023
30
• Transformations: transformations do not return a single value, they return a new RDD. Nothing is evaluated when one uses a transformation function, this function just takes and RDD and returns a new RDD . The processing functions such as map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey ...
• Actions: shares evaluate and return a new value. When an action function is called on a RDD object, all data processing requests are calculated and the result is returned. The actions are reduce, collect, count, first, take, countByKey and foreach ...
![Page 31: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/31.jpg)
03/05/2023
31
Spark EcoSystem
![Page 32: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/32.jpg)
03/05/2023
32
Spark CoreSpark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems,
and more. Spark Core is also home to the API that defines resilient distributed
datasets
(RDDs), which are Spark’s main programming abstraction. RDDs represent a
collection of items distributed across many compute nodes that can be manipulated
in parallel. Spark Core provides many APIs for building and manipulating these
collections.
![Page 33: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/33.jpg)
03/05/2023
33
Example in real Life
Another early Spark adopter is Conviva, one of the largest streaming video companies on the Internet, with about 4 billion video feeds per month (second only to YouTube). As you can imagine, such an operation requires pretty sophisticated behind-the-scenes technology to ensure a high quality of service. As it turns out, it’s using Spark to help deliver that QoS by avoiding dreaded screen buffering.
![Page 34: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/34.jpg)
03/05/2023
34
MapReuce Vs SparkSpark MapReduce
Performance Spark processes data in-memory
On the Disc
Compatibility Apache Spark can run as standalone or
on top of Hadoop YARN or on the
cloud. ..
With hadoop
Data Processing Data Batch
Failure Tolerance + +
Security is still in its infancy Hadoop MapReduce has more security
features and projects.n it comes
to security
![Page 35: introduction to big data frameworks](https://reader035.vdocument.in/reader035/viewer/2022062904/5883eca01a28ab34428b544f/html5/thumbnails/35.jpg)
03/05/2023
35
Conclusion