thinking big with big data

42
Thinking Big An Introduction to Big Data

Upload: shawn-hermans

Post on 11-Aug-2014

753 views

Category:

Data & Analytics


6 download

DESCRIPTION

A non-technical introduction to Big Data that conveys the core concepts and ideas of big data without giving into the hype.

TRANSCRIPT

Page 1: Thinking Big with Big Data

Thinking BigAn Introduction to Big Data

Page 2: Thinking Big with Big Data

About MeShawn Hermans● Data Engineer/Scientist● Technology consultant● Physics, math, data geek

Page 3: Thinking Big with Big Data

About this Talk● Non-technical introduction to Big Data● Not focused on any technology or platform● Focus on concepts

Page 4: Thinking Big with Big Data

Should you believe the hype?

Page 6: Thinking Big with Big Data
Page 7: Thinking Big with Big Data

Big Data Criticism ● Garbage in, Garbage out● Ignores the role of the scientific method● Lots of questions don’t require large

amounts of data to get good stats● Privacy issues

Page 8: Thinking Big with Big Data

Big Data is just another way to think about data

Page 9: Thinking Big with Big Data

Mental Models“A mental model is simply a representation of an external reality inside your head. Mental models are concerned with understanding knowledge about the world.”

- Farnam Street Blog

Page 10: Thinking Big with Big Data

Examples● Occam's razor● Mind maps● Law of supply and demand● Never get in a land war in Asia

Page 11: Thinking Big with Big Data

All models are wrong, but some are useful

Page 12: Thinking Big with Big Data

Relational ResistanceResistance to big data concepts, technologies, and techniques because of belief that the relational model is the only way to think about data.

See also: Theory induced blindness

Page 13: Thinking Big with Big Data
Page 14: Thinking Big with Big Data

Data Mental Models● Relational● Linked● Object Oriented● Geospatial● Temporal

● Semantic● Event Based● Data as Code● Bayesian● Unstructured

Page 15: Thinking Big with Big Data

What is Big Data?

Page 16: Thinking Big with Big Data

“Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”

According to Gartner

Page 17: Thinking Big with Big Data

According to Me

Big data is the Bazaar to traditional data’s Cathedral

Page 18: Thinking Big with Big Data

Cathedral and BazaarTraditional Data● Clean● Top down● Carefully collected● Scales vertically● One true way

Big Data● Disorderly● Bottom up● Randomly collected● Scales horizontally● More than one way

Page 19: Thinking Big with Big Data

Big Data DifferencesRelational● Normalization● ACID● SQL/Query● Structured/Schema

Big Data● Denormalization● BASE● MapReduce/Other● Loosely Structured

Page 20: Thinking Big with Big Data

Integrating all available data is the promise of Big

Data

Page 21: Thinking Big with Big Data

Why should you care?

Page 22: Thinking Big with Big Data
Page 23: Thinking Big with Big Data

Information as an Asset● Target specific customer's needs rather than

broad segments● Just-in-time inventory management● Evaluating demand for product● Predict and track traffic patterns

Page 24: Thinking Big with Big Data

Big Data and You● What information do you have, that no one

else has?● Can you easily integrate your data or is it

locked in silos?● What data don’t you collect?● What data don’t you archive?

Page 25: Thinking Big with Big Data

Big Data Technology

Page 26: Thinking Big with Big Data

Big Data PlatformsCloud● AWS● Google● Microsoft

Hadoop● Cloudera● MapR● Hortonworks

This isn’t an all inclusive list, but a sample of the big players in the space.

Page 27: Thinking Big with Big Data

Big Data Stack● Batch Processing● Data Collection● SQL/Query● Search● Machine Learning● Serialization● Security

● Stream Processing● File Storage● Resource

management● Online NoSQL● Data Pipeline

Page 28: Thinking Big with Big Data
Page 29: Thinking Big with Big Data

What about data science?

Page 30: Thinking Big with Big Data

● Data science is statistics on a Mac● A data scientist is a statistician who lives in

San Francisco● Person who is better at statistics than any

software engineer and better at software engineering than any statistician.

What IS Data Science?

Page 31: Thinking Big with Big Data
Page 32: Thinking Big with Big Data

The need for Data Science● There is a LOT of data● Too much data for people to look at it all● Probabilistic models help extract signal from

the noise● Need to automate the analysis and

exploitation of data

Page 33: Thinking Big with Big Data

Big Data has its limits

Page 34: Thinking Big with Big Data

Black Swans and Big Data● There are fundamental limits to prediction● Hard to predict rare events where no prior

data exists (i.e. Black Swans)● Complex systems often have feedback loops

(e.g. stock market)

Page 35: Thinking Big with Big Data

What’s next?

Page 36: Thinking Big with Big Data

Business● Identify some unresolved

questions● Figure out what data

could answer those questions

● Pick the easiest and test out your hypothesis

Getting StartedTechnology● Pick a technology you

know or want to learn● Pick a platform● Pick a data set and

identify some basic problems to solve

Page 37: Thinking Big with Big Data

My InfoTwitter: @shawnhermans Github: github.com/shawnhermansBlog: http://shawnhermans.github.io/ (In Progress)Slideshare: www.slideshare.net/shawnhermans/Quora: http://www.quora.com/Shawn-Hermans

Page 38: Thinking Big with Big Data

Backup Slides

Page 39: Thinking Big with Big Data
Page 40: Thinking Big with Big Data

The Fourth Quadrant and the Failure of Statistics

Page 41: Thinking Big with Big Data

Soothsayer ● Simple HTTP/JSON

API for training/classifying data

● Lots of built in classifier statistics

https://github.com/shawnhermans/soothsayer

Page 42: Thinking Big with Big Data