how to use big data and data lake concept in business using hadoop and spark - darko marjanovic

Darko Marjanović, CEO @ Things [email protected]

How to use Big Data and Data Lake concept in business using Hadoop and Spark

About me• CEO and Co Founder @ Things Solver• Co Founder @ Data Science Serbia

• Big Data, Machine Learning• Hadoop, Spark, Python

Agenda• Big Data• Data Lake• Data Lake vs Data Warehouse• Hadoop, Spark, Hive• Big Data application and Lambda architecture• Examples• Data Science Lab

Big Data• Big data is a term for data sets that are so large or complex that

traditional data processing applications are inadequate.

• Anything that Won't Fit in Excel :)

Big DataVolumeThe quantity of generated and stored data.VarietyThe type and nature of the data.VelocityIn this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.VeracityThe quality of captured data can vary greatly, affecting accurate analysis.

Big Data

• Email, HTML, Click Stream...• Facebook, Twitter...• Video, Pictures… • Logs...• Sensor Data...• Relational Databases...

Big Data

Data Lake“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

Data Lake - James Dixon, Pentaho chief technology officer

https://en.wikipedia.org/wiki/Pentaho

https://en.wikipedia.org/wiki/Chief_technology_officer

https://en.wikipedia.org/wiki/Chief_technology_officer

Data Lake• Retain All Data• Support All Data types• Support All Users• Adapt Easily to

Changes• Provide Faster Insights

Data Lake

Data Lake Cons• Data storage alone has no impact on the effectiveness of business

decisions• Inexpensive storage is not infinite or limitless

Data WarehouseWikipedia, defines Data Warehouses as:“…central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.”

http://en.wikipedia.org/wiki/Data_warehouse

Data WarehouseProblems:

• New Data Sources, Data Types• Real Time Reports• Streaming Data• Software Price• Infrastructure Price

Data Lake vs Data Warehouse

Data Lake vs Data Warehouse• ETL

• ETL and BI projects by nature are investments into evolving processes and therefore have no distinct end point and is an ongoing, improving and re-targeting project process.

• ETL works from the output backwards and hence on relevant data is extracted and processed.• Future ETL requirements needing data cannot be foreseen and defined in the original design.

• ELT• Isolating Loading and Transforming enables projects to be broken down into specific chunks that are more

isolated and become more manageable.• ELT is an emergent approach to data warehouse design and development requiring a change in mentality

and design approach compared to traditional ETL. • Future requirements can easily be incorporated into the warehouse structure as all data is pulled into the

Data Lake in its raw format.

Hadoop• The Apache Hadoop software library is a framework that allows the

distributed processing of large data sets across clusters of computers using simple programming models.

Hadoop• Pros• Linear scalability.• Commodity hardware.• Pricing and licensing. • All data types.• Analytical queries.• Integration with traditional systems.

• Cons• Implementation.• Map Reduce ease of use.• Intense calculations with little data.• In memory.• Real time analytics.

Apache Spark• Apache Spark is a fast and general engine for big data processing,

with built-in modules for streaming, SQL, machine learning and graph processing.

Apache Spark• Pros• 100X faster than Map Reduce.• Ease of use.• Streaming, Mllib, Graph and SQL.• Pricing and licensing.• In memory. • Integration with Hadoop.• Machine learning.

• Cons• Integration with traditional

systems.• Limited memory per machine(GC).• Configuration.

Apache Spark

Apache Spark• Resilient Distributed Datasets

(RDDs) are the basic units of abstraction in Spark.• RDD is an immutable, partitioned

set of objects.• RDDs are lazy evaluated.• RDDs are fully fault-tolerant. Lost

data can be recovered using the lineage graph of RDDs (by rerunning operations on the input data).

• RDD operations:• Transformations - Lazy evaluated

(executed by calling an action to improve pipelining)• -map, filter, groupByKey, join, ...• Actions - Runned immediately (to

return the value to application/storage)• -count, collect, reduce, save, ...• Don’t forget to cache()

Apache Spark• Dataframes are common abstraction that go across languages, and they

represent a table, or two-dimensional array with columns and rows.

• Spark Datarames are distributed dataframes. They allow querying structured data using SQL or DSL (for example in Python or Scala).

• Like RDDs, Dataframes are also immutable structure.

• They are executed in parallel.• val df = sqlContext.read.json"pathToMyFile.json")

Hive• Apache Hive is a data

warehouse infrastructure for querying, analyzing and managing large datasets residing in distributed storage.

Hive• Pros• Writing ad hoc queries on large

volumes of data.• Imposing a structure on a variety of

data formats.• Interactive SQL queries over large

datasets residing in Hadoop.• SQL-like data access.• Accessing Hadoop data from

traditional DWH environment.

• Cons• Code efficiency can be lower than

in traditional Map Reduce.• Apache Hive has terrible

performance for OLTP tasks.

Ecosystem• Collecting Data• Kafka, Flume…

• Managing Data• Pig, Spark, Hive, Flink, MapReduce

• Resource Manager• YARN, Mesos

• Administration• Ambari, Big Top

Big Data Application

Lambda Architecture• Lambda Architecture is a useful framework to think about designing

big data applications. Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter.

Lambda Architecture• Data• Batch Layer• Serving Layer• Speed Layer

Social Media Analysis

IoT Big Data Application

Planning and Optimizing Data Lake Architecture• Tomorrow, 12h, Big Data Track • Data Lake Architecture in Practice • Optimizing Hive and Spark for Data Lakes

Data Science Lab

datascience.rs

Darko Marjanović, CEO @ Things [email protected]

How to use Big Data and Data Lake concept in business using Hadoop and Spark

how to use big data and data lake concept in business using hadoop and spark - darko marjanovic

Data & Analytics