how to use big data and data lake concept in business using hadoop and spark - darko marjanovic

33
Darko Marjanović, CEO @ Things Solver [email protected] How to use Big Data and Data Lake concept in business using Hadoop and Spark

Upload: institute-of-contemporary-sciences

Post on 13-Jan-2017

30 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Darko Marjanović, CEO @ Things [email protected]

How to use Big Data and Data Lake concept in business using Hadoop and Spark

Page 2: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

About me• CEO and Co Founder @ Things Solver• Co Founder @ Data Science Serbia

• Big Data, Machine Learning• Hadoop, Spark, Python

Page 3: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Agenda• Big Data• Data Lake• Data Lake vs Data Warehouse• Hadoop, Spark, Hive• Big Data application and Lambda architecture• Examples• Data Science Lab

Page 4: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Big Data• Big data is a term for data sets that are so large or complex that

traditional data processing applications are inadequate.

• Anything that Won't Fit in Excel :)

Page 5: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Big DataVolumeThe quantity of generated and stored data.VarietyThe type and nature of the data.VelocityIn this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.VeracityThe quality of captured data can vary greatly, affecting accurate analysis.

Page 6: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Big Data

• Email, HTML, Click Stream...• Facebook, Twitter...• Video, Pictures… • Logs...• Sensor Data...• Relational Databases...

Page 7: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Big Data

Page 8: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Data Lake“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

Data Lake - James Dixon, Pentaho chief technology officer

Page 9: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Data Lake• Retain All Data• Support All Data types• Support All Users• Adapt Easily to

Changes• Provide Faster Insights

Page 10: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Data Lake

Page 11: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Data Lake Cons• Data storage alone has no impact on the effectiveness of business

decisions• Inexpensive storage is not infinite or limitless

Page 12: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Data WarehouseWikipedia, defines Data Warehouses as:“…central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.”

Page 13: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Data WarehouseProblems:

• New Data Sources, Data Types• Real Time Reports• Streaming Data• Software Price• Infrastructure Price

Page 14: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Data Lake vs Data Warehouse

Page 15: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Data Lake vs Data Warehouse• ETL

• ETL and BI projects by nature are investments into evolving processes and therefore have no distinct end point and is an ongoing, improving and re-targeting project process.

• ETL works from the output backwards and hence on relevant data is extracted and processed.• Future ETL requirements needing data cannot be foreseen and defined in the original design.

• ELT• Isolating Loading and Transforming enables projects to be broken down into specific chunks that are more

isolated and become more manageable.• ELT is an emergent approach to data warehouse design and development requiring a change in mentality

and design approach compared to traditional ETL. • Future requirements can easily be incorporated into the warehouse structure as all data is pulled into the

Data Lake in its raw format.

Page 16: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Hadoop• The Apache Hadoop software library is a framework that allows the

distributed processing of large data sets across clusters of computers using simple programming models.

Page 17: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Hadoop• Pros• Linear scalability.• Commodity hardware.• Pricing and licensing. • All data types.• Analytical queries.• Integration with traditional systems.

• Cons• Implementation.• Map Reduce ease of use.• Intense calculations with little data.• In memory.• Real time analytics.

Page 18: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Apache Spark• Apache Spark is a fast and general engine for big data processing,

with built-in modules for streaming, SQL, machine learning and graph processing.

Page 19: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Apache Spark• Pros• 100X faster than Map Reduce.• Ease of use.• Streaming, Mllib, Graph and SQL.• Pricing and licensing.• In memory. • Integration with Hadoop.• Machine learning.

• Cons• Integration with traditional

systems.• Limited memory per machine(GC).• Configuration.

Page 20: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Apache Spark

Page 21: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Apache Spark• Resilient Distributed Datasets

(RDDs) are the basic units of abstraction in Spark.• RDD is an immutable, partitioned

set of objects.• RDDs are lazy evaluated.• RDDs are fully fault-tolerant. Lost

data can be recovered using the lineage graph of RDDs (by rerunning operations on the input data).

• RDD operations:• Transformations - Lazy evaluated

(executed by calling an action to improve pipelining)• -map, filter, groupByKey, join, ...• Actions - Runned immediately (to

return the value to application/storage)• -count, collect, reduce, save, ...• Don’t forget to cache()

Page 22: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Apache Spark• Dataframes are common abstraction that go across languages, and they

represent a table, or two-dimensional array with columns and rows.

• Spark Datarames are distributed dataframes. They allow querying structured data using SQL or DSL (for example in Python or Scala).

• Like RDDs, Dataframes are also immutable structure.

• They are executed in parallel.• val df = sqlContext.read.json"pathToMyFile.json")

Page 23: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Hive• Apache Hive is a data

warehouse infrastructure for querying, analyzing and managing large datasets residing in distributed storage.

Page 24: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Hive• Pros• Writing ad hoc queries on large

volumes of data.• Imposing a structure on a variety of

data formats.• Interactive SQL queries over large

datasets residing in Hadoop.• SQL-like data access.• Accessing Hadoop data from

traditional DWH environment.

• Cons• Code efficiency can be lower than

in traditional Map Reduce.• Apache Hive has terrible

performance for OLTP tasks.

Page 25: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Ecosystem• Collecting Data• Kafka, Flume…

• Managing Data• Pig, Spark, Hive, Flink, MapReduce

• Resource Manager• YARN, Mesos

• Administration• Ambari, Big Top

Page 26: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Big Data Application

Page 27: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Lambda Architecture• Lambda Architecture is a useful framework to think about designing

big data applications. Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter.

Page 28: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Lambda Architecture• Data• Batch Layer• Serving Layer• Speed Layer

Page 29: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Social Media Analysis

Page 30: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

IoT Big Data Application

Page 31: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Planning and Optimizing Data Lake Architecture• Tomorrow, 12h, Big Data Track • Data Lake Architecture in Practice • Optimizing Hive and Spark for Data Lakes

Page 32: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Data Science Lab

datascience.rs

Page 33: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Darko Marjanović, CEO @ Things [email protected]

How to use Big Data and Data Lake concept in business using Hadoop and Spark