big data analytics: technology's bleeding edge

25

Upload: bhavya-gulati

Post on 15-Jan-2015

1.068 views

Category:

Technology


0 download

DESCRIPTION

There can be data without information , but there can not be information without data. Companies without Big Data Analytics are deaf and dumb , mere wanderers on web.

TRANSCRIPT

Page 1: Big data analytics: Technology's bleeding edge
Page 2: Big data analytics: Technology's bleeding edge
Page 3: Big data analytics: Technology's bleeding edge

Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional data management tools.

Big Data can take up terabytes and petabytes of storage space in diverse formats including text, video, sound, images etc.

Traditional relational database management systems cannot deal with such large masses of data.

Examples : User updates over fb. Clicks over the internet.

Page 4: Big data analytics: Technology's bleeding edge

Volume refers to huge amount of data being generated every minute.

90% of the data we have now is created in just past 2 years.

IP traffic by 2015 would turn 4X than what it is now.

3 billion people would be online by 2015 .

Page 5: Big data analytics: Technology's bleeding edge

Velocity refers to SPEED at which new data is being generated and moves around.

It includes Real time working systems such as Online banking.

Need of low response time.

Technology “In-Memory Analytics” is employed to deal with data in motion.

Page 6: Big data analytics: Technology's bleeding edge

Variety refers to various datatypes which we can now use.

Earlier focus was on neat and structured data kept in form of tables in RDBMS.

80% of data available now is unstructured data

Datatypes are anomalous varying from text to videos to audios to pictures.

Page 7: Big data analytics: Technology's bleeding edge

Transform problems into possibilities

Page 8: Big data analytics: Technology's bleeding edge

It is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other real- time insights.

Use of Big Data Analytics – Google Search recommendations, Satyamev jayte, Genes reading

Data Mining Big data AnalyticsData constraints like data must be neat and clean

Big data can not be neat as it is unstructured

Elaborate ETL required thus have to wait for completion of ETL cycle for insights.

Big data analytics provide real – time insights.

Page 9: Big data analytics: Technology's bleeding edge

Descriptive

Diagnostic

Predictive

Prescriptive

Page 10: Big data analytics: Technology's bleeding edge

Relational databases failed to store and process Big Data.

As a result, a new class of big data technology has emerged and is being used in many big data analytics environments.

The technologies associated with big data analytics include Hadoop Mapreduce NoSQL

Page 11: Big data analytics: Technology's bleeding edge

Hadoop is a open source framework

Java-based programming framework

Processing and storing of large data sets

Distributed computing environment.

Components of hadoop HDFS( hadoop distributed

file system) Mapreduce

Page 12: Big data analytics: Technology's bleeding edge

HDFS stores data in DISTRIBUTED,SCALABLE and FAULT-TOLERANT WAY.

Name node have metadata about data on DataNodes

DataNodes actually have data on them in form of blocks and they are capable of communicating

Page 13: Big data analytics: Technology's bleeding edge

Hadoop SQL

Data is stored in form of compressed files across n number of commodity servers

Data is stored in form of tables and columns with relation in them

Fault tolerant – if one node fails ,system still work

If any one node crashes ,it gives error so as to maintain consistency

Any questions ???...

Page 14: Big data analytics: Technology's bleeding edge

Copying same file over all (thousands) of nodes ? doesn’t it seem like wastage of space !

It actually is not a waste memory, because of 2 reasons: If one node failed ,System would still work as data is never

lost.

The query is scaled over nodes so it bring about faster results due to parallel processing

eg- Select the count of word ‘happy’ on twitter. The query is split across multiple servers with a criteria (here months), and the results are consolidated.

Page 15: Big data analytics: Technology's bleeding edge

MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.

as in previous example twitter data was processed on different servers on basis of months .

Hadoop is the physical implementation of Mapreduce .

It is combination of 2 java functions : Mapper() and Reducer()

example: to check popularity of text.

use of word-count..

Page 16: Big data analytics: Technology's bleeding edge
Page 17: Big data analytics: Technology's bleeding edge

Mapper function maps the split files and provide input to reducer

Mapper ( filename , file –contents):for each word in file-contents:

emit (word , 1)

Reducer function clubs the input provided by mapper and produce output

Reducer ( word , values):sum=0;for each value in values:

sum=sum + valueemit(word , sum)

can anyone think of any disadvantages??..

Page 18: Big data analytics: Technology's bleeding edge

There were 2 major disadvantages when hadoop was developed which now have been dissolved

HDFS dependency on single Namenode solution: A secondary Namenode is attached to Primary

Namenode MapReduce is a java fraamework and did not support sql queries

solution: Facebook developed HIVE which allowed scientists work with sql on distributed database.

Page 19: Big data analytics: Technology's bleeding edge

Not only SQL

Non- relational database management system

Used where no fix schemas are required and data is scaled horizontally.

4 Categories of Nosql databases: Key-value pair Columnar database Graph databases Document databases

Page 20: Big data analytics: Technology's bleeding edge

KEY-VALUE PAIR

keys used to get Value from opaque Data blocks

Hash map

Tremendously fast

Drawback:No provision for content based queries .

Page 21: Big data analytics: Technology's bleeding edge

DOCUMENT DATABASE

• Again a key value store but value is in form of document.

• Documents are not of fixed schemas

• documents can be nested

• Queries based on content as well as keys

• Use cases: blogging websites

Page 22: Big data analytics: Technology's bleeding edge

COLUMNAR DATABASE

Works on attributes rather than tuples

Key here is column name and value is contiguous column values

Best for aggregation queries

Trend : select (1 or 2 column’s values ) where ( same or the other column value ) = some value.

Page 23: Big data analytics: Technology's bleeding edge

GRAPH DATABASES

• Is a collection of nodes and edges

• Nodes represent datawhile edge represent link between them

• Most dynamic and flexible

Page 24: Big data analytics: Technology's bleeding edge

Websites :• http://searchbusinessanalytics.techtarget.com/

Experts sound off on big data , Analytics and its tools• http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Big data and analytics hub• https://bigdatauniversity.com/bdu-wp/bdu-course/hadoop-fundamentals-i-version-3/

Hadoop fundamentalsResearch papers :•MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay GhemawatAppeared in:OSDI'04: Sixth Symposium on Operating System Design San Francisco, CA, December, 2004.

Page 25: Big data analytics: Technology's bleeding edge

Data is the new oil

Without Big data analysis companies are deaf and dumb , mere wanderers on web ... Like a cattle on the highway !

Thank you !Keep dreaming BIG :D