thinking big with big data
Click here to load reader
Post on 21-Apr-2017
2.422 views
Embed Size (px)
TRANSCRIPT
Thinking BigAn Introduction to Big Data
About MeShawn HermansData Engineer/ScientistTechnology consultantPhysics, math, data geek
About this TalkNon-technical introduction to Big DataNot focused on any technology or platformFocus on concepts
Should you believe the hype?
No need for scientific methodPredict disease outbreaks before the CDCCure cancerInnovating healthcareSolve world hungerBring about world peaceBig Data Promises
https://twitter.com/BigDataBorat/status/349293502498213888
Big Data Criticism Garbage in, Garbage outIgnores the role of the scientific methodLots of questions dont require large amounts of data to get good statsPrivacy issues
Big Data is just another way to think about data
Mental ModelsA mental model is simply a representation of an external reality inside your head. Mental models are concerned with understanding knowledge about the world. - Farnam Street Blog
ExamplesOccam's razorMind mapsLaw of supply and demandNever get in a land war in Asia
All models are wrong, but some are useful
Quote by http://en.wikiquote.org/wiki/George_E._P._Box
Relational ResistanceResistance to big data concepts, technologies, and techniques because of belief that the relational model is the only way to think about data.
See also: Theory induced blindness
See http://www.bloomberg.com/news/2011-10-25/bias-blindness-and-how-we-truly-think-part-2-daniel-kahneman.html
Data Mental ModelsRelationalLinkedObject OrientedGeospatialTemporalSemanticEvent BasedData as CodeBayesianUnstructured
What is Big Data?
Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.According to Gartner
According to MeBig data is the Bazaar to traditional datas Cathedral
Inspired by Eric Raymonds Cathedral and the Bazaar - http://www.catb.org/esr/writings/cathedral-bazaar/introduction/
Cathedral and BazaarTraditional DataCleanTop downCarefully collectedScales verticallyOne true wayBig DataDisorderlyBottom upRandomly collectedScales horizontallyMore than one way
Big Data DifferencesRelationalNormalizationACIDSQL/QueryStructured/SchemaBig DataDenormalizationBASEMapReduce/OtherLoosely Structured
BASE (basically available soft-state eventual consistency)See CAP theorem for more details http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
Integrating all available data is the promise of Big Data
Why should you care?
Big data might not save the world, but it could entertain us
http://www.fastcodesign.com/1671893/the-secret-sauce-behind-netflixs-hit-house-of-cards-big-data
Information as an AssetTarget specific customer's needs rather than broad segmentsJust-in-time inventory managementEvaluating demand for productPredict and track traffic patterns
http://blogs.wsj.com/digits/2014/01/17/amazon-wants-to-ship-your-package-before-you-buy-it/http://en.wikipedia.org/wiki/Google_Traffic#Crowdsourced_traffic_data
Big Data and YouWhat information do you have, that no one else has?Can you easily integrate your data or is it locked in silos?What data dont you collect?What data dont you archive?
Big Data and You sounds like a good childrens book title.
Big Data Technology
Big Data PlatformsCloudAWSGoogleMicrosoft
HadoopClouderaMapRHortonworksThis isnt an all inclusive list, but a sample of the big players in the space.
Big Data StackBatch ProcessingData CollectionSQL/QuerySearchMachine LearningSerializationSecurityStream ProcessingFile StorageResource managementOnline NoSQLData Pipeline
This is admin screen for Amazon Web Services. Not all of these services are Big Data, but it gives you a good idea of an integrated Big Data platform.
What about data science?
Data science is statistics on a MacA data scientist is a statistician who lives in San FranciscoPerson who is better at statistics than any software engineer and better at software engineering than any statistician.What IS Data Science?
https://twitter.com/cdixon/status/428914681911070720https://twitter.com/BigDataBorat/status/372350993255518208
https://twitter.com/josh_wills/status/198093512149958656
Although use of the term data science has exploded in business environments, many academics and journalists see no distinction between data science and statistics. Writing in Forbes, Gil Press argues that data science is a buzzword without a clear definition and has simply replaced business analytics in contexts such as graduate degree programs.[13] In the question-and-answer section of his keynote address at the Joint Statistical Meetings of American Statistical Association, noted applied statistician Nate Silver said, I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldnt berate the term statistician.[14]
From Drew Conway http://en.wikipedia.org/wiki/Data_science#mediaviewer/File:Data_Science_Venn_Diagram.png
The need for Data ScienceThere is a LOT of dataToo much data for people to look at it allProbabilistic models help extract signal from the noiseNeed to automate the analysis and exploitation of data
Big Data has its limits
Black Swans and Big DataThere are fundamental limits to predictionHard to predict rare events where no prior data exists (i.e. Black Swans)Complex systems often have feedback loops (e.g. stock market)
See Nassim Talebs excellent essay The Fourth Quadrant - http://edge.org/conversation/the-fourth-quadrant-a-map-of-the-limits-of-statistics
Whats next?
BusinessIdentify some unresolved questionsFigure out what data could answer those questionsPick the easiest and test out your hypothesisGetting StartedTechnologyPick a technology you know or want to learnPick a platformPick a data set and identify some basic problems to solve
See http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public for datasets
My InfoTwitter: @shawnhermans Github: github.com/shawnhermansBlog: http://shawnhermans.github.io/ (In Progress)Slideshare: www.slideshare.net/shawnhermans/Quora: http://www.quora.com/Shawn-Hermans
Backup Slides
http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting
The Fourth Quadrant and the Failure of Statistics
Soothsayer Simple HTTP/JSON API for training/classifying data Lots of built in classifier statistics
https://github.com/shawnhermans/soothsayer