thinking big with big data

Click here to load reader

Post on 21-Apr-2017

2.422 views

Category:

Data & Analytics

2 download

Embed Size (px)

TRANSCRIPT

Thinking BigAn Introduction to Big Data

About MeShawn HermansData Engineer/ScientistTechnology consultantPhysics, math, data geek

About this TalkNon-technical introduction to Big DataNot focused on any technology or platformFocus on concepts

Should you believe the hype?

No need for scientific methodPredict disease outbreaks before the CDCCure cancerInnovating healthcareSolve world hungerBring about world peaceBig Data Promises

https://twitter.com/BigDataBorat/status/349293502498213888

Big Data Criticism Garbage in, Garbage outIgnores the role of the scientific methodLots of questions dont require large amounts of data to get good statsPrivacy issues

Big Data is just another way to think about data

Mental ModelsA mental model is simply a representation of an external reality inside your head. Mental models are concerned with understanding knowledge about the world. - Farnam Street Blog

ExamplesOccam's razorMind mapsLaw of supply and demandNever get in a land war in Asia

All models are wrong, but some are useful

Quote by http://en.wikiquote.org/wiki/George_E._P._Box

Relational ResistanceResistance to big data concepts, technologies, and techniques because of belief that the relational model is the only way to think about data.

See also: Theory induced blindness

See http://www.bloomberg.com/news/2011-10-25/bias-blindness-and-how-we-truly-think-part-2-daniel-kahneman.html

Data Mental ModelsRelationalLinkedObject OrientedGeospatialTemporalSemanticEvent BasedData as CodeBayesianUnstructured

What is Big Data?

Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.According to Gartner

According to MeBig data is the Bazaar to traditional datas Cathedral

Inspired by Eric Raymonds Cathedral and the Bazaar - http://www.catb.org/esr/writings/cathedral-bazaar/introduction/

Cathedral and BazaarTraditional DataCleanTop downCarefully collectedScales verticallyOne true wayBig DataDisorderlyBottom upRandomly collectedScales horizontallyMore than one way

Big Data DifferencesRelationalNormalizationACIDSQL/QueryStructured/SchemaBig DataDenormalizationBASEMapReduce/OtherLoosely Structured

BASE (basically available soft-state eventual consistency)See CAP theorem for more details http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

Integrating all available data is the promise of Big Data

Why should you care?

Big data might not save the world, but it could entertain us

http://www.fastcodesign.com/1671893/the-secret-sauce-behind-netflixs-hit-house-of-cards-big-data

Information as an AssetTarget specific customer's needs rather than broad segmentsJust-in-time inventory managementEvaluating demand for productPredict and track traffic patterns

http://blogs.wsj.com/digits/2014/01/17/amazon-wants-to-ship-your-package-before-you-buy-it/http://en.wikipedia.org/wiki/Google_Traffic#Crowdsourced_traffic_data

Big Data and YouWhat information do you have, that no one else has?Can you easily integrate your data or is it locked in silos?What data dont you collect?What data dont you archive?

Big Data and You sounds like a good childrens book title.

Big Data Technology

Big Data PlatformsCloudAWSGoogleMicrosoft

HadoopClouderaMapRHortonworksThis isnt an all inclusive list, but a sample of the big players in the space.

Big Data StackBatch ProcessingData CollectionSQL/QuerySearchMachine LearningSerializationSecurityStream ProcessingFile StorageResource managementOnline NoSQLData Pipeline

This is admin screen for Amazon Web Services. Not all of these services are Big Data, but it gives you a good idea of an integrated Big Data platform.

What about data science?

Data science is statistics on a MacA data scientist is a statistician who lives in San FranciscoPerson who is better at statistics than any software engineer and better at software engineering than any statistician.What IS Data Science?

https://twitter.com/cdixon/status/428914681911070720https://twitter.com/BigDataBorat/status/372350993255518208

https://twitter.com/josh_wills/status/198093512149958656

Although use of the term data science has exploded in business environments, many academics and journalists see no distinction between data science and statistics. Writing in Forbes, Gil Press argues that data science is a buzzword without a clear definition and has simply replaced business analytics in contexts such as graduate degree programs.[13] In the question-and-answer section of his keynote address at the Joint Statistical Meetings of American Statistical Association, noted applied statistician Nate Silver said, I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldnt berate the term statistician.[14]

From Drew Conway http://en.wikipedia.org/wiki/Data_science#mediaviewer/File:Data_Science_Venn_Diagram.png

The need for Data ScienceThere is a LOT of dataToo much data for people to look at it allProbabilistic models help extract signal from the noiseNeed to automate the analysis and exploitation of data

Big Data has its limits

Black Swans and Big DataThere are fundamental limits to predictionHard to predict rare events where no prior data exists (i.e. Black Swans)Complex systems often have feedback loops (e.g. stock market)

See Nassim Talebs excellent essay The Fourth Quadrant - http://edge.org/conversation/the-fourth-quadrant-a-map-of-the-limits-of-statistics

Whats next?

BusinessIdentify some unresolved questionsFigure out what data could answer those questionsPick the easiest and test out your hypothesisGetting StartedTechnologyPick a technology you know or want to learnPick a platformPick a data set and identify some basic problems to solve

See http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public for datasets

My InfoTwitter: @shawnhermans Github: github.com/shawnhermansBlog: http://shawnhermans.github.io/ (In Progress)Slideshare: www.slideshare.net/shawnhermans/Quora: http://www.quora.com/Shawn-Hermans

Backup Slides

http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting

The Fourth Quadrant and the Failure of Statistics

Soothsayer Simple HTTP/JSON API for training/classifying data Lots of built in classifier statistics

https://github.com/shawnhermans/soothsayer